Main Question
Our main question:
- How does the inclusion of different numbers of age groups influence the results of an analysis of right to carry laws and violence rates?
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
This case study will introduce the topic of multicolinearity. We will do so by showcasing a real world example where multicolinearity in part resulted in historically contriversial and conflicting findings about the influence of the adoption of right-to-carry (RTC) concealed handgun laws on violent crime rates in the United States.
We will focus on two articles:
This has been a controversial topic as many other articles also had conflicting results. See here for a list of studies.
The Donohue, et al. article discusses how there are many other important methodolical aspects besides multicolinearity that could account for the historically conflicting results in these previous papers.
In fact, nearly every aspect of the data analysis process was different between the Donohue, et al. analysis and the Lott and Mustard analysis.
However, we will focus particularly on multicolinearity and we will explore how it can influence linear regression analyses and result in different conclusions.
This analysis will demonstrate how methodological details can be critically influential for our overall conclusions and can result in important policy related consequences. This article will provide a basis for the motivation.
John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. Journal of Empirical Legal Studies, 16,2 (2019).
David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. Coase-Sandor Institute for Law & Economics Working Paper No. 41, (1996).
Here you can see the differences in the data used in the featured RTC articles:
We will perform analyses similar to those in these articles, however we will not try to recreate them, instead we will simplify our analysis to allow us to focus on multicolinearity.
Therefore we will use a subset of the listed explanatory variables and they will be consistent for both analyses that we will perform, with the exception that one analysis will have 6 demographic variables like the analysis in the Donohue, et al. article and the other will have 36 demogrpahic variables like the analysis in the Lott and Mustard article.
Our main question:
Statistical Learning Objectives:
Data science Learning Objectives:
readr, readxl, pdftools)dplyr)stringr)dplyr and janitor)tidyr)ggplot2)We will especially focus on using packages and functions from the Tidyverse, such as dplyr and ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
So what exactly is a right-to-carry law?
It is a law thatspecifies if and how citizens are allowed to have a firearm on their person or nearby (for example in the citizen’s car) in public.
The Second Amendment to the United States Constitution guarantees the right to “keep and bear arms”. The amendment was ratified in 1791 as part of the Bill of Rights.
However, there are no federal laws about carrying firearms in public.
These laws are created and enforced at the state level. Sates vary greatly in their laws about the right to carry firearms. Some require extensive effort to obtain a permit to legally carry a firearm, while other states require very minimal effort to legally carry a firearm.
According to Wikipedia about the history of right-to-carry policies in the United States:
Public perception on concealed carry vs open carry has largely flipped. In the early days of the United States, open carrying of firearms, long guns and revolvers was a common and well-accepted practice. Seeing guns carried openly was not considered to be any cause for alarm. Therefore, anyone who would carry a firearm but attempt to conceal it was considered to have something to hide, and presumed to be a criminal. For this reason, concealed carry was denounced as a detestable practice in the early days of the United States.
Concealed weapons bans were passed in Kentucky and Louisiana in 1813. (In those days open carry of weapons for self-defense was considered acceptable; concealed carry was denounced as the practice of criminals.) By 1859, Indiana, Tennessee, Virginia, Alabama, and Ohio had followed suit. By the end of the nineteenth century, similar laws were passed in places such as Texas, Florida, and Oklahoma, which protected some gun rights in their state constitutions. Before the mid 1900s, most U.S. states had passed concealed carry laws rather than banning weapons completely. Until the late 1990s, many Southern states were either “No-Issue” or “Restrictive May-Issue”. Since then, these states have largely enacted “Shall-Issue” licensing laws, with numerous states legalizing “Unrestricted concealed carry”.
See here for more information.
Here are the general categories of Right to Carry Laws:
You can see that none of the fifty states have no-issue laws currently, meaning that all states allow the right to carry firearms at least in some way, however the level of restrictions is dramatically different from one state to another.
Here you can see how these laws have changed over time around the country:
There is variation from state to state even within the same general category:
For example here are the current carry laws in Idaho which is considered an “Unrestricted - no permit required” state:
Idaho permits the open carrying of firearms.
Idaho law permits both residents and non-residents who are at least 18 years old to carry concealed weapons, without a carry license, outside the limits of or confines of any city, provided the person is not otherwise disqualified from being issued a license to carry.
A person may also carry concealed weapons on or about his or her person, without a license, in the person’s own place of abode or fixed place of business, on property in which the person has any ownership or leasehold interest, or on private property where the person has permission to carry from any person who has an ownership or leasehold interest in that property.
State law also allows any resident of Idaho or a current member of the armed forces of the United States to carry a concealed handgun without a license to carry, provided the person is over 18 years old and not disqualified from being issued a license to carry concealed weapons under state law. An amendment to state law that takes effect on July 1, 2020 changes the reference in the above law from “a resident of Idaho” to “any citizen of the United States.”
And here are the current carry laws in Arizona which is also considered an “Unrestricted- - no permit required” state:
Arizona respects the right of law abiding citizens to openly carry a handgun.
Any person 21 years of age or older, who is not prohibited possessor, may carry a weapon openly or concealed without the need for a license. Any person carrying without a license must acknowledge and comply with the demands of a law enforcement officer when asked if he/she is carrying a concealed deadly weapon, if the officer has initiated an “investigation” such as a traffic stop.
Notice that citizens in Idaho only need to be 18 to carry a firearm, whereas they must be 21 in Arizona.
In contrast here is an example of current carry laws in Maryland which is considered a “Rights Restricted-Very Limited Issue” state:
Carrying and Transportation in Vehicles It is unlawful for any person without a permit to wear or carry a handgun, openly or concealed, upon or about his person. It is also unlawful for any person to knowingly transport a handgun in any vehicle traveling on public roads, highways, waterways or airways, or upon roads or parking lots generally used by the public. This does not apply to any person wearing, carrying or transporting a handgun within the confines of real estate owned or leased by him, or on which he resides, or within the confines of a business establishment owned or leased by him.
Permit To Carry Application for a permit to carry a handgun is made to the Secretary of State Police. In addition to the printed application form, the applicant should submit a notarized letter stating the reasons why he is applying for a permit.
avocado….Right to carry and covid masks?
There are some important considerations regarding this data analysis to keep in mind:
We do not use all of the data used by either the Lott and Mustard or Donohue, et al. analyses, nor do we perform the same analysis of each article. We instead perform a much simpler analysis with less variables for the purposes of illustration of the concept of multicollinearity and its influence on regression coefficients, not to reproduce either analysis.
Because our analysis is an oversimplification, our analysis should not be used for determining policy changes, instead we suggest that users consult with a specialist.
We would also like to note that…AVOCADO It is important that we do not treat race as an objective measure. Despite this, it can be used to advance scientific inquiry. For more information on this topic, we have included a link to a paper on the use of race as a measure in epidemiology.
We will begin by loading the packages that we will need:
library(here)
library(readxl)
library(readr)
library(pdftools)
library(dplyr)
library(magrittr)
library(tidyr)
library(stringr)
library(purrr)
library(forcats)
library(tibble)
library(car) # vif function
library(plm) # fixed effect model, linear regression
library(broom) # tidy output
library(tidyverse) # general wrangling functions
library(cowplot) # to produce plot of plots
library(GGally)
library(ggrepel)
library(scales)
library(latex2exp)
library(viridis)
library(ggcorrplot)
library(rsample)
set.seed(999)| Package | Use |
|---|---|
| here | to easily load and save data |
| readr | to import the CSV file data |
| [car] | to calculate vif values |
| [purrr] | to combine multiple tibbles within a list of tibbles |
| [forcats] | to collapse levels of factors into more summarised versions |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Below is a table from the Donohue, et al. paper that shows the data used in both analyses, where DAW stands for Donohue, et al. and LM stands for Lott and Mustard.
We will be using a subset of these variables, which are highlighted in green:
To obtain information about age, sex, and race, and overall population we will use US Census Bureau data, just like both of the articles. The cesnus data is available for different time spans. Here are the links for the years used in our analysis. We will use data from 1977 to 2010.
| Data | Link |
|---|---|
| years 1977 to 1979 | link |
| years 1980 to 1989 | link * county data was used for this decade which also has state information |
| years 1990 to 1999 | link |
| years 2000 to 2010 | link technical documentation |
To import the data we will use the read_csv() function of the readr package for the csv files. In some decades, there are separate files for each year, we will read each of these together using the base list.files() function to get all of the names for each file and then the map() function of the purrr package to apply the read_csv() function on all of the file paths in the list created by list.files(). For years that are txt files we will use read_table2() also fo the readr package. The read_table2() function, unlike the read_table(), allows for any number of whitespace characters between columns, and the lines can be of different lengths.
AVOCADO I am a bit confused about the last decade… it’s only one file but it seems to need map…
dem_77_79 <- read_csv("docs/Demographics/Decade_1970/pe-19.csv", skip = 5)
dem_80_89 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1980/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(., skip=5))
dem_90_99 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1990/",
pattern = "*.txt",
full.names = TRUE) %>%
map(~read_table2(., skip = 14))
dem_00_10_2 <- read_csv("docs/Demographics/Decade_2000/st-est00int-alldata.csv")
dem_00_10 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_2000/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(.))
head(dem_00_10)[[1]]
# A tibble: 62,244 x 21
REGION DIVISION STATE NAME SEX ORIGIN RACE AGEGRP ESTIMATESBASE20…
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 Unit… 0 0 0 0 281424600
2 0 0 0 Unit… 0 0 0 1 19176154
3 0 0 0 Unit… 0 0 0 2 20549855
4 0 0 0 Unit… 0 0 0 3 20528425
5 0 0 0 Unit… 0 0 0 4 20218782
6 0 0 0 Unit… 0 0 0 5 18962964
7 0 0 0 Unit… 0 0 0 6 19381792
8 0 0 0 Unit… 0 0 0 7 20511067
9 0 0 0 Unit… 0 0 0 8 22707390
10 0 0 0 Unit… 0 0 0 9 22442442
# … with 62,234 more rows, and 12 more variables: POPESTIMATE2000 <dbl>,
# POPESTIMATE2001 <dbl>, POPESTIMATE2002 <dbl>, POPESTIMATE2003 <dbl>,
# POPESTIMATE2004 <dbl>, POPESTIMATE2005 <dbl>, POPESTIMATE2006 <dbl>,
# POPESTIMATE2007 <dbl>, POPESTIMATE2008 <dbl>, POPESTIMATE2009 <dbl>,
# CENSUS2010POP <dbl>, POPESTIMATE2010 <dbl>
Notice that the STATE variable for the demographic data is numeric. That is because it is encoded by Federal Information Processing Standard (FIPS) state codes{target="_blank". Thus we also need to import data about FIPS encoding so that we can identify what data corresponds to what state.
The following data was downloaded from the US Census Bureau.
To import the data we will use the read_xls() function of the readxl package. Since the first five lines of this excel is information about the source of the data and when it was released, we need to skip importing these lines using the skip argument so that the data has the same number of columns for each row.
# A tibble: 64 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
7 1 1 44 Rhode Island
8 1 1 50 Vermont
9 1 2 00 Middle Atlantic Division
10 1 2 34 New Jersey
# … with 54 more rows
The following data was downloaded from the Federal Bureau of Investigation.
The read_csv() function of the readr package guesses what the class is for each variable, but sometimes it makes mistakes. It is good to specify the class for variables if you know them. We know that we want the variables about male and female counts to be numeric. We can specify that using the col_types = argument. See here and here for more information.
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv")
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = "n",
female_total_ct = "n"))
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = col_double(),
female_total_ct = col_double()))
head(ps_data) # A tibble: 6 x 21
data_year ori pub_agency_name pub_agency_unit state_abbr division_name
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1960 AK02… Alcohol Bevera… <NA> AK Pacific
2 1960 AL00… Homewood <NA> AL East South C…
3 1960 AL01… Coffeeville <NA> AL East South C…
4 1960 AL01… Coffee <NA> AL East South C…
5 1960 AL02… Mentone <NA> AL East South C…
6 1960 AL03… Greensboro <NA> AL East South C…
# … with 15 more variables: region_name <chr>, county_name <chr>,
# agency_type_name <chr>, population_group_desc <chr>, population <dbl>,
# male_officer_ct <dbl>, male_civilian_ct <dbl>, male_total_ct <dbl>,
# female_officer_ct <lgl>, female_civilian_ct <lgl>, female_total_ct <dbl>,
# officer_ct <lgl>, civilian_ct <lgl>, total_pe_ct <lgl>,
# pe_ct_per_1000 <lgl>
The following data was downloaded from the U.S. Bureau of Labor Statistics.
There are excel files for each state. As you can see, there are many rows to skip to make sure that there are the same number of columns for each row. We can also see that the state name is located in a couple of the first rows.
We can also see that here if we just try to read in the files directly.
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(.))
head(ue_rate_data)[1][[1]]
# A tibble: 55 x 14
`Local Area Une… ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Original Data V… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Series Id: LAUS… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 Not Seasonally … <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 Area: Alab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 Area Type: Stat… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 State/Region/Di… Alab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
8 Measure: unem… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
9 Years: 1977… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# … with 45 more rows, and 3 more variables: ...12 <chr>, ...13 <chr>,
# ...14 <chr>
So now we will skip the first 10 lines. And also create a names tibble that contains only the cell with the state information.
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., skip = 10))
head(ue_rate_data[1])[[1]]
# A tibble: 44 x 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1977 7.5 9 7.7 7.2 6.8 8.6 8 7.8 6.7 6.3 6.3 6
2 1978 7.1 6.9 6.2 5.4 5.1 6.9 6.7 6.7 6.5 6.3 6.3 6.5
3 1979 6.7 7.5 6.9 6.6 6.4 8.4 7.7 7.8 7.1 7.2 6.9 6.7
4 1980 7.7 7.8 7.4 7.4 8.4 9.7 10.4 10.3 9.3 9.6 9.4 9
5 1981 10 10.3 9.5 9.1 9.4 11.1 10.4 10.9 10.8 11.7 11.5 11.8
6 1982 13.2 13.2 12.9 12.6 12.8 14.5 14.7 14.8 14.7 15.1 15.4 15.3
7 1983 16 16 14.5 13.7 13.3 14.6 13.9 13.8 13.2 12.8 12.1 11.8
8 1984 12.5 12.4 11.4 10.8 10.1 11.3 11.5 11.3 10.8 10.2 9.7 10.1
9 1985 10.7 10.5 9.8 8.7 8.4 9.6 9.2 8.8 8.6 8.6 8.4 8.7
10 1986 9.3 10.4 10.1 9.4 9.4 10.5 9.7 9.6 9.7 9.7 9.6 9
# … with 34 more rows, and 1 more variable: Annual <dbl>
To get the state name for each file using the map() function to perform functions across all of the files, we will specifically import only a small range of cells using the range = argument and then grab the cell that has state information based on it’s location within the range of cells imported using c() and then use the base unlist() function to unlist the list that this creates.
ue_rate_names <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., range = "B4:B6")) %>%
map(., c(1,2)) %>%
unlist()
ue_rate_names [1] "Alabama" "Alaska" "Arizona"
[4] "Arkansas" "California" "Colorado"
[7] "Connecticut" "Delaware" "District of Columbia"
[10] "Florida" "Georgia" "Hawaii"
[13] "Idaho" "Illinois" "Indiana"
[16] "Iowa" "Kansas" "Kentucky"
[19] "Louisiana" "Maine" "Maryland"
[22] "Massachusetts" "Michigan" "Minnesota"
[25] "Mississippi" "Missouri" "Montana"
[28] "Nebraska" "Nevada" "New Hampshire"
[31] "New Jersey" "New Mexico" "New York"
[34] "North Carolina" "North Dakota" "Ohio"
[37] "Oklahoma" "Oregon" "Pennsylvania"
[40] "Rhode Island" "South Carolina" "South Dakota"
[43] "Tennessee" "Texas" "Utah"
[46] "Vermont" "Virginia" "Washington"
[49] "West Virginia" "Wisconsin" "Wyoming"
Now we will make these values the names of the different tibbles within ue_rate_data.
Extracted from Table 21 from US Census Bureau Poverty Data
AVOCado strange issue
#**persistent warning from unknown origin** https://community.rstudio.com/t/persistent-unknown-or-uninitialised-column-warnings/64879
#solution to above is alledgedly: "In any case the suggested approach is to initialize the column"
poverty_rate_data <- read_xls("docs/Poverty/hstpov21.xls", skip=2) #This may cause initialization issue, not easily reproducible (even after restarting R)
head(poverty_rate_data)# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
We can see that this will require some wranlging to make the data more usable.
Violent crime data was obtained from here This data is a bit trickier because of spaces and / in the column names, thus the read_lines() function of the readr package works better than the read_csv() function.
crime_data <- read_lines("docs/Crime/CrimeStatebyState.csv", skip = 2, skip_empty_rows = TRUE)
head(crime_data)[1] "Estimated crime in Alabama"
[2] "\n,,National or state crime,,,,,,,"
[3] "\n,,Violent crime,,,,,,,"
[4] "\nYear,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
[5] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[6] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
We can see that this data will also require some wranlging to make it more usable.
This data is extracted from table in Donohue paper {target="_blank"}. We will use the function pdf_text() of the pdftools package to import the pdf document.
if(!file.exists(here("docs", "w23510.pdf"))){
url <- "https://www.nber.org/papers/w23510.pdf"
utils::download.file(url, here("docs", "w23510.pdf"))
}
DAWpaper <- pdf_text(here("docs", "w23510.pdf"))
head(DAWpaper[1])[1] " NBER WORKING PAPER SERIES\n RIGHT-TO-CARRY LAWS AND VIOLENT CRIME:\n A COMPREHENSIVE ASSESSMENT USING PANEL DATA AND\n A STATE-LEVEL SYNTHETIC CONTROL ANALYSIS\n John J. Donohue\n Abhay Aneja\n Kyle D. Weber\n Working Paper 23510\n http://www.nber.org/papers/w23510\n NATIONAL BUREAU OF ECONOMIC RESEARCH\n 1050 Massachusetts Avenue\n Cambridge, MA 02138\n June 2017, Revised November 2018\nPreviously circulated as \"Right-to-Carry Laws and Violent Crime: A Comprehensive Assessment\nUsing Panel Data and a State-Level Synthetic Controls Analysis.\" We thank Dan Ho, Stefano\nDellaVigna, Rob Tibshirani, Trevor Hastie, StefanWager, Jeff Strnad, and participants at the\n2011 Conference of Empirical Legal Studies (CELS), 2012 American Law and Economics\nAssociation (ALEA) Annual Meeting, 2013 Canadian Law and Economics Association (CLEA)\nAnnual Meeting, 2015 NBER Summer Institute (Crime), and the Stanford Law School faculty\nworkshop for their comments and helpful suggestions. Financial support was provided by\nStanford Law School. We are indebted to Alberto Abadie, Alexis Diamond, and Jens\nHainmueller for their work developing the synthetic control algorithm and programming the Stata\nmodule used in this paper and for their helpful comments. The authors would also like to thank\nAlex Albright, Andrew Baker, Jacob Dorn, Bhargav Gopal, Crystal Huang, Mira Korb, Haksoo\nLee, Isaac Rabbani, Akshay Rao, Vikram Rao, Henrik Sachs and Sidharth Sah who provided\nexcellent research assistance, as well as Addis O’Connor and Alex Chekholko at the Research\nComputing division of Stanford’s Information Technology Services for their technical support.\nThe views expressed herein are those of the author and do not necessarily reflect the views of the\nNational Bureau of Economic Research.\nNBER working papers are circulated for discussion and comment purposes. They have not been\npeer-reviewed or been subject to the review by the NBER Board of Directors that accompanies\nofficial NBER publications.\n© 2017 by John J. Donohue, Abhay Aneja, and Kyle D. Weber. All rights reserved. Short\nsections of text, not to exceed two paragraphs, may be quoted without explicit permission\nprovided that full credit, including © notice, is given to the source.\n"
Again, this data will also require quite a bit of wrangling.
Let’s first take a look at our state FIPS data to see if it needs any cleaning or reshaping. We should start with this data, becuase we will need to use it to wrangle some of the other data.
# A tibble: 6 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
We only need the last two columns, but we might want to rename them. The Name variable is vague. The variable with the FIPS code is called State\n(FIPS). To get rid of the new line in this variable name and to change the Name variable to something more informative, we will use the rename() function of the dplyr package. To use this function, we need to list the new name first followed by = and then the existing variable. We can rename multiple variables at the same time by using a comma to separate the variables we are renaming. We will use the select() function also of the dplyr package just to keep these variables, and we will filter out the rows with FIPS values of 00 with the filter() function, agian also part of the dplyr package. we will specify that we want STATEFP values that are not equal to 00 by using this operator: !=. We will also use the double pipe operator %<>% of the magrittr package which allows us to use data as iuput and then reassign it after we peform sum functions using it.
STATE_FIPS %<>%
dplyr::rename( STATEFP = `State\n(FIPS)`,
STATE = Name) %>%
dplyr::select(STATEFP, STATE) %>%
dplyr::filter(STATEFP != "00")
STATE_FIPS# A tibble: 51 x 2
STATEFP STATE
<chr> <chr>
1 09 Connecticut
2 23 Maine
3 25 Massachusetts
4 33 New Hampshire
5 44 Rhode Island
6 50 Vermont
7 34 New Jersey
8 36 New York
9 42 Pennsylvania
10 17 Illinois
# … with 41 more rows
Click here to see detailed information about how the demogrphic data was wrangled
1977-1979
Now let’s take a look at our demographic data across the decades that we wish to study. If you have very wide data (meaning it has many columns), one way to view the data so that you can see all of the columns at the same time is to use the glimpse() function of the dplyr package.
Taking a look at the first decade of data, we can see that the Race/Sex Indicator contains two types of data, the race and the sex. This does not follow the tidy data philosophy, where each cell of a tibble should only contain one piece of information. Typically one might think of using the separate() function of the tidyr package to split this variable into two. However, one of the race values is Other races and since this also has a space, this makes separating this data more tricky.
Instead we will use the str_extract() function of the stringr package and the mutate() function of the dplyr package. The “mutate()” will allow us to create new variables, and “str_extract()” function will allow us to match specific patterns and pull out matches to those patterns. Therefore, if the Race/Sex Indicator value is Other races male and if we extract patterns matching either "male" or "female" which we can specify like this pattern = "male|female" then, the value will be male.
First we need to rename the Race/Sex Indicator varaible to not have spaces so that it is compatible with the str_extract() function.
We also want to rename a couple of variables to be simpler and filter the data to only include the years of the data we are interested in, as well as remove some variables that we dont need like the FIPS State Code. We can remove variables by using the select() function with a - minus sign in front of the variable we wish to remove.
Rows: 3,060
Columns: 22
$ `Year of Estimate` <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, …
$ `FIPS State Code` <chr> "01", "01", "01", "01", "01", "01", "02", "02", …
$ `State Name` <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black male", "Bla…
$ `Under 5 years` <dbl> 105856, 100613, 47403, 47079, 244, 250, 12382, 1…
$ `5 to 9 years` <dbl> 120876, 115194, 55443, 54851, 255, 251, 13888, 1…
$ `10 to 14 years` <dbl> 129091, 122352, 60427, 60065, 253, 245, 13255, 1…
$ `15 to 19 years` <dbl> 119500, 116107, 52921, 55144, 281, 254, 11179, 9…
$ `20 to 24 years` <dbl> 103665, 108513, 29948, 35165, 413, 331, 20237, 1…
$ `25 to 29 years` <dbl> 86538, 88359, 19535, 23662, 239, 302, 12538, 107…
$ `30 to 34 years` <dbl> 74452, 77595, 17196, 22021, 236, 284, 10331, 865…
$ `35 to 39 years` <dbl> 71511, 74941, 16654, 22248, 161, 279, 9548, 7510…
$ `40 to 44 years` <dbl> 75242, 78908, 17564, 24249, 127, 253, 8282, 6353…
$ `45 to 49 years` <dbl> 73874, 78589, 18186, 23028, 108, 148, 6995, 5820…
$ `50 to 54 years` <dbl> 68048, 72481, 17618, 22104, 95, 100, 5609, 4494,…
$ `55 to 59 years` <dbl> 61071, 67699, 18118, 21909, 88, 93, 4029, 2986, …
$ `60 to 64 years` <dbl> 52361, 61065, 16456, 20068, 69, 94, 2392, 1830, …
$ `65 to 69 years` <dbl> 38977, 49685, 14498, 19364, 54, 73, 1292, 965, 2…
$ `70 to 74 years` <dbl> 26767, 37227, 9541, 12509, 70, 66, 602, 496, 8, …
$ `75 to 79 years` <dbl> 17504, 27163, 6030, 8291, 31, 52, 326, 305, 1, 5…
$ `80 to 84 years` <dbl> 9937, 16470, 3485, 5031, 37, 30, 211, 186, 4, 5,…
$ `85 years and over` <dbl> 5616, 10445, 2448, 4035, 76, 29, 143, 126, 19, 4…
dem_77_79 <- dem_77_79 %>%
rename("race_sex" =`Race/Sex Indicator`) %>%
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))%>%
select(-`FIPS State Code`, -`race_sex`) %>%
rename("YEAR" = `Year of Estimate`,
"STATE" = `State Name`) %>%
filter(YEAR %in% 1977:1979)
glimpse(dem_77_79)Rows: 918
Columns: 22
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ `Under 5 years` <dbl> 98814, 94595, 46201, 45784, 590, 621, 14316, 1353…
$ `5 to 9 years` <dbl> 113365, 107395, 50097, 49329, 672, 660, 14621, 13…
$ `10 to 14 years` <dbl> 123107, 116182, 54925, 53955, 677, 653, 14795, 13…
$ `15 to 19 years` <dbl> 135343, 130433, 58468, 59926, 674, 605, 15207, 13…
$ `20 to 24 years` <dbl> 126053, 125352, 43898, 51433, 722, 773, 20106, 16…
$ `25 to 29 years` <dbl> 111547, 112471, 31014, 36648, 638, 835, 20444, 18…
$ `30 to 34 years` <dbl> 100674, 101543, 22528, 26694, 571, 766, 17514, 15…
$ `35 to 39 years` <dbl> 81038, 83369, 17473, 22213, 498, 586, 13098, 1069…
$ `40 to 44 years` <dbl> 75042, 77793, 16446, 22146, 356, 479, 10067, 7935…
$ `45 to 49 years` <dbl> 76296, 79753, 16578, 22576, 295, 432, 8460, 6848,…
$ `50 to 54 years` <dbl> 74844, 81079, 17117, 23028, 206, 326, 7268, 5914,…
$ `55 to 59 years` <dbl> 67785, 75905, 16437, 21435, 166, 213, 5398, 4485,…
$ `60 to 64 years` <dbl> 58853, 69406, 16276, 21075, 145, 174, 3349, 2708,…
$ `65 to 69 years` <dbl> 48848, 62430, 15837, 21126, 107, 173, 1714, 1468,…
$ `70 to 74 years` <dbl> 34475, 50075, 11450, 16028, 90, 138, 915, 928, 22…
$ `75 to 79 years` <dbl> 20977, 34027, 7601, 10825, 53, 106, 500, 493, 10,…
$ `80 to 84 years` <dbl> 10831, 21483, 3896, 6272, 25, 49, 237, 268, 4, 7,…
$ `85 years and over` <dbl> 6683, 15729, 2667, 5426, 33, 41, 153, 211, 11, 6,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
That’s looking pretty good! We also want to take all the age group variabels and make one variable that is the age group name and one that is the value of the population count for that age group. To do this we will use the pivot_longer() function of the tidyr package. To use this function, we need to use the cols argument to indicate which columns we want to pivot. We also name the new variables we will create with the names_to and values_to arguments. The names_to will be the name of the variable that will identify each age group and values_to will be the name of the variable that contains the corresponding population values.
dem_77_79 <- dem_77_79 %>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP")
glimpse(dem_77_79)Rows: 16,524
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ SEX <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
$ RACE <chr> "White", "White", "White", "White", "White", "White", "Whit…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 1…
$ SUB_POP <dbl> 98814, 113365, 123107, 135343, 126053, 111547, 100674, 8103…
We also want to get data about the total population for the state for each year.
To do so we can sum all the values for the SUB_POP variable that we just created. To do this we can use the group_by and summarise() functions of the dplyr package. The group_by() function specifies how we want to calculate our sum, that we would like to calculate it for each year and each state individually. Thus, all the values that have the same STATE and YEAR values will be summed together, rather than summing using all of the values in the SUB_POP variable. The .groups argument allows us to remove the grouping after we peform the calculation with summarise().
pop_77_79 <- dem_77_79 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
pop_77_79 # A tibble: 153 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
7 1977 Connecticut 3088745
8 1977 Delaware 594815
9 1977 District of Columbia 681766
10 1977 Florida 8888806
# … with 143 more rows
Now we will add the population value to the demographic tibble using the left_join() function of the dplyr package. It is imporant that we specify how this should be done, that the YEAR and STATE variable vlaues should match eachother. This will place the dem_77_79 variables to the left of the pop_77_79 data.
# A tibble: 16,524 x 7
YEAR STATE SEX RACE AGE_GROUP SUB_POP TOT_POP
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1977 Alabama male White Under 5 years 98814 3782571
2 1977 Alabama male White 5 to 9 years 113365 3782571
3 1977 Alabama male White 10 to 14 years 123107 3782571
4 1977 Alabama male White 15 to 19 years 135343 3782571
5 1977 Alabama male White 20 to 24 years 126053 3782571
6 1977 Alabama male White 25 to 29 years 111547 3782571
7 1977 Alabama male White 30 to 34 years 100674 3782571
8 1977 Alabama male White 35 to 39 years 81038 3782571
9 1977 Alabama male White 40 to 44 years 75042 3782571
10 1977 Alabama male White 45 to 49 years 76296 3782571
# … with 16,514 more rows
We will also calculate the percentage that each group makes up of the total population, by dividing the SUB_POP by the TOT_POP and multiplying by 100 using the mutate() function. we will also remove the other population variables.
dem_77_79 %<>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
select(-SUB_POP, -TOT_POP)
dem_77_79# A tibble: 16,524 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama male White Under 5 years 2.61
2 1977 Alabama male White 5 to 9 years 3.00
3 1977 Alabama male White 10 to 14 years 3.25
4 1977 Alabama male White 15 to 19 years 3.58
5 1977 Alabama male White 20 to 24 years 3.33
6 1977 Alabama male White 25 to 29 years 2.95
7 1977 Alabama male White 30 to 34 years 2.66
8 1977 Alabama male White 35 to 39 years 2.14
9 1977 Alabama male White 40 to 44 years 1.98
10 1977 Alabama male White 45 to 49 years 2.02
# … with 16,514 more rows
It is important to make sure that we have the total values we would expect. We have two levels of SEX, three levels of Race, three levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE. If we multiply this together we get 16,524 which is the same as the number of rows in our final dem_77_79 data. Looks good!
Also Let’s make the values of the SEX variable capatalized so that they match the other values of the other variables like RACE etc. This will help us to keep consistent values across the different years as we wrangle the data for the other decades. To do so we will use the str_to_title() function of the stringr package. We need to use the pull() function to get the values of SEX out of dem_77_79. Once we make them captialized they are then reasigned to the SEX variable.
dem_77_79 %<>%
mutate(SEX = str_to_title(pull(dem_77_79, SEX)))
# This can also be done line this:
dem_77_79 %<>%
mutate(SEX = str_to_title(pull(., SEX)))1980-1989
For this decade each year is a separate tibble and they are combined as a list.
[1] "list"
So the first thing we need to do is combine each tibble of the list together. We can do that using the bind_rows() function of dplyr which appends the data together based on the presence of columns with the same name in the different tibbles. We will use the map_df() function of the purrr package to allow us to do this across each tibble in our list.
Rows: 188,460
Columns: 21
$ `Year of Estimate` <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black ma…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
Great! Now our data is all together.
Now we will wrangle the data similarly to the previous decade.
dem_80_89 <- dem_80_89 %>%
rename("race_sex" =`Race/Sex Indicator`) %>%
mutate(SEX = str_extract(race_sex, "male|female"),
RACE = str_extract(race_sex, "Black|White|Other"))%>%
select( -`race_sex`) %>%
rename("YEAR" = `Year of Estimate`)
glimpse(dem_80_89)Rows: 188,460
Columns: 22
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
$ SEX <chr> "male", "female", "male", "female", "ma…
$ RACE <chr> "White", "White", "Black", "Black", "Ot…
Notice that this time the state information is based on the numeric FIPS value. We want only the first two values, as the rest indicate the county. We can use the str_sub() function of the stringr package for this. We will specify that we want to start at the first position and end at the second. Just like str_extract() we need to rename this variable first so that it is compatible.
dem_80_89 %<>%
rename("STATEFP_temp" = "FIPS State and County Codes") %>%
mutate(STATEFP = str_sub(STATEFP_temp, start = 1, end = 2)) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
glimpse(dem_80_89)Rows: 188,460
Columns: 23
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1…
$ STATEFP_temp <chr> "01001", "01001", "01001", "01001", "01001", "010…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 672, 645, 3…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, 740, 680, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614, 644, 670…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841, 711, 762…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, 516, 601, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, 414, 469, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, 303, 352, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202, 224, 260,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, 206, 288, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, 219, 236, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 203, 261, 7…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 178, 219, 8…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 171, 209, 8…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 170, 232, 6…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 164, 182, 4,…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 129, 3, 6, …
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67, 1, 2, 56…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65, 1, 1, 30,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
dem_80_89 %<>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP_temp") %>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups="drop")
dem_80_89# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 50108
2 1980 Alabama 10 to 14 years female Other 805
3 1980 Alabama 10 to 14 years female White 109066
4 1980 Alabama 10 to 14 years male Black 50768
5 1980 Alabama 10 to 14 years male Other 826
6 1980 Alabama 10 to 14 years male White 115988
7 1980 Alabama 15 to 19 years female Black 58428
8 1980 Alabama 15 to 19 years female Other 743
9 1980 Alabama 15 to 19 years female White 126783
10 1980 Alabama 15 to 19 years male Black 56808
# … with 55,070 more rows
pop_80_89 <- dem_80_89 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
dem_80_89 <- dem_80_89 %>%
left_join(pop_80_89, by = c("YEAR","STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
dplyr::select(-SUB_POP, -TOT_POP)
dem_80_89# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 1.28
2 1980 Alabama 10 to 14 years female Other 0.0206
3 1980 Alabama 10 to 14 years female White 2.80
4 1980 Alabama 10 to 14 years male Black 1.30
5 1980 Alabama 10 to 14 years male Other 0.0212
6 1980 Alabama 10 to 14 years male White 2.97
7 1980 Alabama 15 to 19 years female Black 1.50
8 1980 Alabama 15 to 19 years female Other 0.0191
9 1980 Alabama 15 to 19 years female White 3.25
10 1980 Alabama 15 to 19 years male Black 1.46
# … with 55,070 more rows
Just like with the data from the 70s we will also change the values for SEX to be capitalized.
Again, it is important to make sure that we have the total values we would expect. This time we have: two levels of SEX, three levels of Race, ten levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE.
If we multiply these together we get 55,080, which is the same as the number of rows of the final dem_80_89 data. Looks good!
1990-1999
Just like the 80s we need to combine the data across the files:
Rows: 43,870
Columns: 19
$ Year <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 19…
$ e <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Male <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605, …
$ Female <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316, …
$ Male_1 <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 10…
$ Female_1 <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 97…
$ Male_2 <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179,…
$ Female_2 <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 202…
$ Male_3 <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, 1…
$ Female_3 <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, 1…
$ Male_4 <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, 1…
$ Female_4 <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, 1…
$ Male_5 <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32, …
$ Female_5 <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29, …
$ Male_6 <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4,…
$ Female_6 <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6,…
$ Male_7 <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 11…
$ Female_7 <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3, …
For this decade the column names can’t all be imported in a simple way from the table, so they need to be recoded.
Here is what the data looks like before importing:
So, first using the base colnames() function we change the names of the column names.
colnames(dem_90_99) <- c("YEAR",
"STATEFP",
"Age",
"NH_W_M",
"NH_W_F",
"NH_B_M",
"NH_B_F",
"NH_AIAN_M",
"NH_AIAN_F",
"NH_API_M",
"NH_API_F",
"H_W_M",
"H_W_F",
"H_B_M",
"H_B_F",
"H_AIAN_M",
"H_AIAN_F",
"H_API_M",
"H_API_F")
glimpse(dem_90_99)Rows: 43,870
Columns: 19
$ YEAR <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
$ STATEFP <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ NH_W_M <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605,…
$ NH_W_F <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316,…
$ NH_B_M <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 1…
$ NH_B_F <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 9…
$ NH_AIAN_M <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179…
$ NH_AIAN_F <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 20…
$ NH_API_M <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, …
$ NH_API_F <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, …
$ H_W_M <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, …
$ H_W_F <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, …
$ H_B_M <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32,…
$ H_B_F <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29,…
$ H_AIAN_M <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4…
$ H_AIAN_F <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6…
$ H_API_M <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 1…
$ H_API_F <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3,…
Notice also that the first row is all NA values from white space in the orginal table for 1990, this is probably true for each year. We can check them dimensions of our table using the base dim() function. When we filter for rows where YEAR is NA, we indeed see 10 rows, which is what we would expect if we have a row like this for each of the years in the decade. We see the same if we try a different variable. Now we will test to see how large our tibble is if we drop rows with NA values using the drop_na() function of tidyr. We that indeed our dimensions only changed by ten, so there are not other rows with missing values that we might not expect. So now we will resign the dem_90_99 variable after removing these rows.
[1] 43870 19
# A tibble: 10 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA
2 NA <NA> NA NA NA NA NA NA NA NA
3 NA <NA> NA NA NA NA NA NA NA NA
4 NA <NA> NA NA NA NA NA NA NA NA
5 NA <NA> NA NA NA NA NA NA NA NA
6 NA <NA> NA NA NA NA NA NA NA NA
7 NA <NA> NA NA NA NA NA NA NA NA
8 NA <NA> NA NA NA NA NA NA NA NA
9 NA <NA> NA NA NA NA NA NA NA NA
10 NA <NA> NA NA NA NA NA NA NA NA
# … with 9 more variables: NH_API_F <dbl>, H_W_M <dbl>, H_W_F <dbl>,
# H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>, H_API_M <dbl>,
# H_API_F <dbl>
# A tibble: 10 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA
2 NA <NA> NA NA NA NA NA NA NA NA
3 NA <NA> NA NA NA NA NA NA NA NA
4 NA <NA> NA NA NA NA NA NA NA NA
5 NA <NA> NA NA NA NA NA NA NA NA
6 NA <NA> NA NA NA NA NA NA NA NA
7 NA <NA> NA NA NA NA NA NA NA NA
8 NA <NA> NA NA NA NA NA NA NA NA
9 NA <NA> NA NA NA NA NA NA NA NA
10 NA <NA> NA NA NA NA NA NA NA NA
# … with 9 more variables: NH_API_F <dbl>, H_W_M <dbl>, H_W_F <dbl>,
# H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>, H_API_M <dbl>,
# H_API_F <dbl>
# A tibble: 43,860 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 01 0 20406 19101 9794 9414 103 90 192
2 1990 01 1 19393 18114 9475 9247 87 93 146
3 1990 01 2 18990 18043 9097 8837 97 100 175
4 1990 01 3 19246 17786 9002 8701 94 115 150
5 1990 01 4 19502 18366 9076 8989 108 114 168
6 1990 01 5 19560 18386 9169 9093 128 130 170
7 1990 01 6 19091 18047 8919 8736 160 134 183
8 1990 01 7 19605 18316 9219 9192 178 162 171
9 1990 01 8 18823 17743 9247 9108 166 155 136
10 1990 01 9 20226 19178 10194 9784 205 193 177
# … with 43,850 more rows, and 9 more variables: NH_API_F <dbl>, H_W_M <dbl>,
# H_W_F <dbl>, H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>,
# H_API_M <dbl>, H_API_F <dbl>
Then we sum across the nonhispanic and hispaninc groups because this information is not available for the other previous decades. Then we will remove the variables for the hispanic and nonhispanic subgroups using select().
dem_90_99%<>%
mutate(W_M = NH_W_M + H_W_M,
W_F = NH_W_F + H_W_F,
B_M = NH_B_M + H_B_M,
B_F = NH_B_F + H_B_F,
AIAN_M = NH_AIAN_M + H_AIAN_M,
AIAN_F = NH_AIAN_F + H_AIAN_F,
API_M = NH_API_M + H_API_M,
API_F = NH_API_F + H_API_F) %>%
select(-starts_with("NH_"), -starts_with("H_"))
glimpse(dem_90_99)Rows: 43,860
Columns: 11
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
$ Age <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ W_M <dbl> 20629, 19583, 19188, 19432, 19692, 19770, 19279, 19783, 19005…
$ W_F <dbl> 19321, 18310, 18216, 17977, 18556, 18556, 18219, 18495, 17916…
$ B_M <dbl> 9841, 9516, 9129, 9037, 9112, 9199, 8947, 9246, 9276, 10226, …
$ B_F <dbl> 9459, 9294, 8878, 8731, 9015, 9130, 8759, 9227, 9139, 9812, 1…
$ AIAN_M <dbl> 104, 89, 98, 103, 113, 136, 162, 182, 172, 211, 194, 180, 209…
$ AIAN_F <dbl> 98, 93, 102, 116, 118, 135, 137, 166, 159, 196, 189, 204, 198…
$ API_M <dbl> 197, 153, 177, 153, 173, 181, 185, 178, 148, 187, 176, 164, 1…
$ API_F <dbl> 175, 187, 165, 160, 192, 164, 180, 183, 188, 190, 176, 168, 1…
Looking better! We also need to add age groups like the other decades. We will take a look at the 80s data using the distinct() function of the dplyr package to see what age groups we need. We can use the base cut() function to create a new variable with mutate() called AGE_GROUP that will have a label for every change in 5 years of age. The right = FALSE argument specifies that the interval is not closed on the right, meaning that if the value is at the cutpoint like the Age value is 5, then it will be in the 5 to 9 years group.
We can make the labels for the AGE_GROUP variable match those of dem_77_79 but we need to pull out the values of the tibble created by distinct(). To do this we can use the pull() function from the dplyr package. Note that it is important to check that the AGE_GROUP values are listed in order for dem_77_79. We will also remove the Age variable after we create the new AGE_GROUP variable for the dem_90_99 data.
# A tibble: 18 x 1
AGE_GROUP
<chr>
1 Under 5 years
2 5 to 9 years
3 10 to 14 years
4 15 to 19 years
5 20 to 24 years
6 25 to 29 years
7 30 to 34 years
8 35 to 39 years
9 40 to 44 years
10 45 to 49 years
11 50 to 54 years
12 55 to 59 years
13 60 to 64 years
14 65 to 69 years
15 70 to 74 years
16 75 to 79 years
17 80 to 84 years
18 85 years and over
[1] "Under 5 years" "5 to 9 years" "10 to 14 years"
[4] "15 to 19 years" "20 to 24 years" "25 to 29 years"
[7] "30 to 34 years" "35 to 39 years" "40 to 44 years"
[10] "45 to 49 years" "50 to 54 years" "55 to 59 years"
[13] "60 to 64 years" "65 to 69 years" "70 to 74 years"
[16] "75 to 79 years" "80 to 84 years" "85 years and over"
dem_90_99 %<>%
mutate(AGE_GROUP = cut(Age,
breaks = seq(0,90, by=5),
right = FALSE, labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP))) %>%
select(-Age)
glimpse(dem_90_99)Rows: 43,860
Columns: 11
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",…
$ W_M <dbl> 20629, 19583, 19188, 19432, 19692, 19770, 19279, 19783, 190…
$ W_F <dbl> 19321, 18310, 18216, 17977, 18556, 18556, 18219, 18495, 179…
$ B_M <dbl> 9841, 9516, 9129, 9037, 9112, 9199, 8947, 9246, 9276, 10226…
$ B_F <dbl> 9459, 9294, 8878, 8731, 9015, 9130, 8759, 9227, 9139, 9812,…
$ AIAN_M <dbl> 104, 89, 98, 103, 113, 136, 162, 182, 172, 211, 194, 180, 2…
$ AIAN_F <dbl> 98, 93, 102, 116, 118, 135, 137, 166, 159, 196, 189, 204, 1…
$ API_M <dbl> 197, 153, 177, 153, 173, 181, 185, 178, 148, 187, 176, 164,…
$ API_F <dbl> 175, 187, 165, 160, 192, 164, 180, 183, 188, 190, 176, 168,…
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years,…
Like the previous decades we will create a RACE and SUB_POP variable using pivot_longer() to create a single Race variable out of all the subgroup variables.
Now we need to collapse the data for the various races so that it matches the previous decades. This time we will use the case_when() function of the dplyr package and the str_detect() function of the stringr package to identify when the race is something other than B or W and replace with the value Other. The value to the right of the ~ indicates what we want the value of the new variable to be if the value of the variable we are using with str_decect() matches the condition specified. If the value does not match the specified condition, than the other values will be what ever is listed after TRUE ~. We will then create population counts as we did previously for the other decades.
Finally, we will create new sums for the subpopulations where we sum across the two Other subgroups Race to a create a single value for each value of YEAR, SEX, AGE_GROUP, and STATE by using the group_by() function and summarie().
dem_90_99 %<>%
pivot_longer(cols = c(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")),
names_to = "RACE",
values_to = "SUB_POP_temp")
dem_90_99 %<>%
mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
TRUE ~ "Female"),
RACE = case_when(str_detect(RACE, "W_") ~ "White",
str_detect(RACE, "B_") ~ "Black",
TRUE ~ "Other")) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
dem_90_99 %<>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups="drop")pop_90_99 <- dem_90_99 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")
dem_90_99 <- dem_90_99 %>%
left_join(pop_90_99, by=c("YEAR", "STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
dplyr::select(-SUB_POP, -TOT_POP)
dem_90_99# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <fct> <chr> <chr> <dbl>
1 1990 Alabama Under 5 years Female Black 1.12
2 1990 Alabama Under 5 years Female Other 0.0347
3 1990 Alabama Under 5 years Female White 2.28
4 1990 Alabama Under 5 years Male Black 1.15
5 1990 Alabama Under 5 years Male Other 0.0336
6 1990 Alabama Under 5 years Male White 2.43
7 1990 Alabama 5 to 9 years Female Black 1.14
8 1990 Alabama 5 to 9 years Female Other 0.0419
9 1990 Alabama 5 to 9 years Female White 2.29
10 1990 Alabama 5 to 9 years Male Black 1.16
# … with 55,070 more rows
Again, we should check to make sure that we have the total values we would expect. We have the same number of unique values for each of our variables as in with the data from the 80s, so if we collpased the data for the different additional subpopulations in this data, then we have done it correctly.
Indeed it looks like we have 55,080 rows, which is what we would expect and is the same as the number of rows of the final dem_80_89 data. Looks good!
2000-2010
Again, for this decade we need to combine the data across years.
Rows: 62,244
Columns: 21
$ REGION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIVISION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ STATE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ NAME <chr> "United States", "United States", "United States", …
$ SEX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ORIGIN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ RACE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ AGEGRP <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ ESTIMATESBASE2000 <dbl> 281424600, 19176154, 20549855, 20528425, 20218782, …
$ POPESTIMATE2000 <dbl> 282162411, 19178293, 20463852, 20637696, 20294955, …
$ POPESTIMATE2001 <dbl> 284968955, 19298217, 20173362, 20978678, 20456284, …
$ POPESTIMATE2002 <dbl> 287625193, 19429192, 19872417, 21261421, 20610370, …
$ POPESTIMATE2003 <dbl> 290107933, 19592446, 19620851, 21415353, 20797166, …
$ POPESTIMATE2004 <dbl> 292805298, 19785885, 19454237, 21411680, 21102552, …
$ POPESTIMATE2005 <dbl> 295516599, 19917400, 19389067, 21212579, 21486214, …
$ POPESTIMATE2006 <dbl> 298379912, 19938883, 19544688, 21033138, 21807709, …
$ POPESTIMATE2007 <dbl> 301231207, 20125962, 19714611, 20841042, 22067816, …
$ POPESTIMATE2008 <dbl> 304093966, 20271127, 19929602, 20706655, 22210880, …
$ POPESTIMATE2009 <dbl> 306771529, 20244518, 20182499, 20660564, 22192810, …
$ CENSUS2010POP <dbl> 308745538, 20201362, 20348657, 20677194, 22040343, …
$ POPESTIMATE2010 <dbl> 309349689, 20200529, 20382409, 20694011, 21959087, …
Ok, the data looks a bit different from the others. First we will remove a couple of variables that we probably don’t need. Also it looks like we have some values for the entire United Sates and we will drop these to be like the other decades.
We can see that there are lots of values that are zero. According to the technical documentation for this data, zero values indicate the total for the other categories of Sex, Origin, Race, and AGEGRP.
So we will drop the total values for SEX, RACE, and AGEGRP by removing the rows where these variables are equal to zero.
We will also want to only select for the total values for Origin as we do not wish to divide the data into subgroups about hispanic ethnicity because we do not have that information for the first two decades. Thus we will filter for only the rows where Origin is equal to zero.
We will also then remove the REGION, Division, STATE, and Origin variables. We will then rename NAME to be STATE and rename AGEGRP to be like the other decades as AGE_GROUP.
dem_00_10 %<>%
filter(SEX != 0,
RACE != 0,
AGEGRP != 0,
ORIGIN == 0) %>%
dplyr::select(-REGION, -DIVISION, -ORIGIN, -STATE) %>%
rename("STATE" = NAME,
"AGE_GROUP" = AGEGRP)
dem_00_10# A tibble: 11,016 x 15
STATE SEX RACE AGE_GROUP POPESTIMATE2000 POPESTIMATE2001 POPESTIMATE2002
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alab… 1 1 1 99527 99985 99578
2 Alab… 1 1 2 104423 102518 101023
3 Alab… 1 1 3 108325 108412 108059
4 Alab… 1 1 4 108638 107370 107337
5 Alab… 1 1 5 104337 107230 108195
6 Alab… 1 1 6 106491 101466 98949
7 Alab… 1 1 7 110116 110630 110416
8 Alab… 1 1 8 123719 120283 116502
9 Alab… 1 1 9 124961 125443 124751
10 Alab… 1 1 10 115024 117010 119354
# … with 11,006 more rows, and 8 more variables: POPESTIMATE2003 <dbl>,
# POPESTIMATE2004 <dbl>, POPESTIMATE2005 <dbl>, POPESTIMATE2006 <dbl>,
# POPESTIMATE2007 <dbl>, POPESTIMATE2008 <dbl>, POPESTIMATE2009 <dbl>,
# POPESTIMATE2010 <dbl>
Now we need to recode the numeric values to the values in the techincal documentation. We can do so by adding labels to each numeric level using the base function factor().
dem_00_10 %<>%
mutate(SEX = factor(SEX,
levels = 1:2,
labels = c("Male",
"Female")),
RACE = factor(RACE,
levels = 1:6,
labels = c("White",
"Black",
rep("Other",4))),
AGE_GROUP = factor(AGE_GROUP,
levels = 1:18,
labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP)))
glimpse(dem_00_10)Rows: 11,016
Columns: 15
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
$ SEX <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,…
$ RACE <fct> White, White, White, White, White, White, White, Whit…
$ AGE_GROUP <fct> Under 5 years, 5 to 9 years, 10 to 14 years, 15 to 19…
$ POPESTIMATE2000 <dbl> 99527, 104423, 108325, 108638, 104337, 106491, 110116…
$ POPESTIMATE2001 <dbl> 99985, 102518, 108412, 107370, 107230, 101466, 110630…
$ POPESTIMATE2002 <dbl> 99578, 101023, 108059, 107337, 108195, 98949, 110416,…
$ POPESTIMATE2003 <dbl> 99627, 99920, 108026, 107749, 109360, 98276, 109893, …
$ POPESTIMATE2004 <dbl> 99788, 99306, 107627, 108666, 109037, 98742, 107653, …
$ POPESTIMATE2005 <dbl> 100316, 99754, 106570, 110278, 108727, 100327, 105151…
$ POPESTIMATE2006 <dbl> 100820, 101251, 106228, 111640, 108847, 103869, 10161…
$ POPESTIMATE2007 <dbl> 101766, 101985, 106243, 112353, 109496, 105175, 99917…
$ POPESTIMATE2008 <dbl> 102304, 102479, 106155, 113305, 110007, 106348, 99921…
$ POPESTIMATE2009 <dbl> 101411, 102688, 106130, 113741, 111167, 106497, 10138…
$ POPESTIMATE2010 <dbl> 99480, 102939, 106324, 112272, 112423, 106593, 102923…
OK, we also want to change the shape of the data so that we have a YEAR variable and each estimate of the population is a value in a new variable called SUB_POP_temp.
dem_00_10 %<>%
pivot_longer(cols=contains("ESTIMATE"),
names_to = "YEAR",
values_to = "SUB_POP_temp")We will now clean up the YEAR variable to only be the numeric value by keeping only the last 4 values of each string using the str_sub() function of the stringr package.
Now we will collapse the data for the different RACES and calculate a new SUB_POP value.
dem_00_10 %<>%
group_by(YEAR, AGE_GROUP, STATE, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups = "drop")Agian, the dimensions look as we expect with 60,588 rows. This time we have two levels of SEX, three levels of Race, 11 levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE. If we multiply this together we get 16,588. Looks good!
Now we will calculate the total polutation and percent of the total as we have done with the previous decades.
pop_00_10 <- dem_00_10 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")We can also check that our wrangling was performecd correctly by summing the values for the individual subpopulations percentages and seeing if it totals to 100.
dem_00_10 %>%
left_join(pop_00_10, by=c("YEAR", "STATE")) %>%
group_by(YEAR, STATE) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
summarise(perc_tot = sum(PERC_SUB_POP), .groups = "drop") %>%
mutate(poss_error = case_when(abs(perc_tot - 100) > 0 ~ TRUE,
TRUE ~ FALSE)) %>%
group_by(poss_error) %>%
tally()# A tibble: 1 x 2
poss_error n
<lgl> <int>
1 FALSE 561
Looks like the percentages for each state for each year all add up to 100, as we would expect. Great! Now we will reasign the dem_00_10 data with this processing.
dem_00_10 %<>%
left_join(pop_00_10, by = c("YEAR", "STATE")) %>%
mutate(PERC_SUB_POP = (SUB_POP/TOT_POP)*100) %>%
select(-SUB_POP, -TOT_POP)
dem_00_10# A tibble: 60,588 x 6
YEAR AGE_GROUP STATE SEX RACE PERC_SUB_POP
<dbl> <fct> <chr> <fct> <fct> <dbl>
1 2000 Under 5 years Alabama Male White 2.24
2 2000 Under 5 years Alabama Male Black 1.05
3 2000 Under 5 years Alabama Male Other 0.101
4 2000 Under 5 years Alabama Female White 2.12
5 2000 Under 5 years Alabama Female Black 1.03
6 2000 Under 5 years Alabama Female Other 0.0995
7 2000 Under 5 years Alaska Male White 2.35
8 2000 Under 5 years Alaska Male Black 0.165
9 2000 Under 5 years Alaska Male Other 1.37
10 2000 Under 5 years Alaska Female White 2.26
# … with 60,578 more rows
OK, now we are ready to combine all of our demgraphic data together!
We can check that the colnames are the same for the data for each of the decades by using the setequal() function of the dplyr package.
[1] TRUE
[1] TRUE
[1] TRUE
We can also confirm that we have the same number of age groups for each decade by using the base length() function. If you did not take a look at the wrangling for the demographic data then you may be unfamiliar with the pull() function of the dplyr package. This allows you to grab the values of a variable from a tibble. The distinct() function which is also of the dplyr package creates a tibble of the unique values for a variable.
[1] 18
[1] 18
[1] 18
[1] 18
Looks good!
Now we will combine the data using the bind_rows() function of the dplyr package. This function appends the data together based on the presence of columns with the same name in the different tibbles.
Rows: 187,272
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 19…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
$ SEX <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", …
$ RACE <chr> "White", "White", "White", "White", "White", "White", "W…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 t…
$ PERC_SUB_POP <dbl> 2.6123502, 2.9970356, 3.2545853, 3.5780690, 3.3324688, 2…
Great! now we have a really large single tibble.
Now we want to select similar demographic data to what was used in the previous analyses.
Here is the table from the Donohue paper that compares the data used in the analyses.
We can see that only the percentage of males that were from age 15-39 of the race groups (black, white, and other) were used in the Donohue analysis.
Ultimately we intend to make a tibble of data that is similar to each analysis. Therefore, we will create a data tibble about the demogrphaic data for each analysis now.
To do so we will first create a vector of the age groups that should be included in the Donohue-like analysis, that we will call DONOHUE_AGE_GROUPS. We will then filter for only the age groups in this vector by using the filter() function of the dplyr package and the %in% operator to indicate that we want to keep all AGE_GROUP values that are equal to those within DONOHUE_AGE_GROUPS. We also want to filter for only population percentages for males by using the == operator. Then we can collpase the age groups from 20-39 by using the fct_collpase() function of the forcats package.
DONOHUE_AGE_GROUPS <- c("15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")
dem_DONOHUE <- dem %>%
filter(AGE_GROUP %in% DONOHUE_AGE_GROUPS,
SEX == "Male") %>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP, "20 to 39 years"=c("20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")))
dem_DONOHUE# A tibble: 26,010 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <fct> <dbl>
1 1977 Alabama Male White 15 to 19 years 3.58
2 1977 Alabama Male White 20 to 39 years 3.33
3 1977 Alabama Male White 20 to 39 years 2.95
4 1977 Alabama Male White 20 to 39 years 2.66
5 1977 Alabama Male White 20 to 39 years 2.14
6 1977 Alabama Male Black 15 to 19 years 1.55
7 1977 Alabama Male Black 20 to 39 years 1.16
8 1977 Alabama Male Black 20 to 39 years 0.820
9 1977 Alabama Male Black 20 to 39 years 0.596
10 1977 Alabama Male Black 20 to 39 years 0.462
# … with 26,000 more rows
We also want to create a new variable that will contain all the demographic information for each percentage just as was done in the Donohue, et al. analysis. This should result in 6 different demographic variables.
To do this we will modify the AGE_GROUP variable by using the mutate() function of the dplyr package. We will replace the spaces in the now two age group categorise with and undesrscore using the str_replace_all() function of the stringr package which replaces all instances of a pattern in a character string.
Then we will use the group_by() function and the summarise() funtion also of the dplyr package to allow us to calculate a sum of the percentages for each of the subpopulation percentages for the newly modifed age groups in AGE_GROUP. The .groups = "drop" argument allows for the grouping to be removed after the summarise() function.
dem_DONOHUE %<>%
mutate(AGE_GROUP = str_replace_all(string = AGE_GROUP,
pattern = " ",
replacement = "_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop")
dem_DONOHUE# A tibble: 10,404 x 6
YEAR STATE RACE SEX AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama Black Male 15_to_19_years 1.55
2 1977 Alabama Black Male 20_to_39_years 3.04
3 1977 Alabama Other Male 15_to_19_years 0.0178
4 1977 Alabama Other Male 20_to_39_years 0.0642
5 1977 Alabama White Male 15_to_19_years 3.58
6 1977 Alabama White Male 20_to_39_years 11.1
7 1977 Alaska Black Male 15_to_19_years 0.163
8 1977 Alaska Black Male 20_to_39_years 0.968
9 1977 Alaska Other Male 15_to_19_years 1.12
10 1977 Alaska Other Male 20_to_39_years 2.73
# … with 10,394 more rows
Now we will combine the variables RACE, SEX, and AGE_GROUP together into one string separated by underscores using the unite function of the tidyr package. we will call this new variable VARIABLE. We will rename the PERC_SUB_POP variable to be VALUE using the rename() function of the dplyr package. The new name should be listed first before the =.
dem_DONOHUE %<>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE" = PERC_SUB_POP)
dem_DONOHUE# A tibble: 10,404 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
7 1977 Alaska Black_Male_15_to_19_years 0.163
8 1977 Alaska Black_Male_20_to_39_years 0.968
9 1977 Alaska Other_Male_15_to_19_years 1.12
10 1977 Alaska Other_Male_20_to_39_years 2.73
# … with 10,394 more rows
Let’s do a quick row number check. We have six different demographic variables, 51 states (DC counts as a state in this case), and 34 different years from 1977 to 2010, we should have 10,404 rows, which we do!
Now, let’s do the same for the “Lott-like” analysis.
So, in this analysis there were 36 variables covering percentages of indiviuals from 10 to over 65, three race groups and both males and females. This table is misprinted and does not include the word “Other” for the third race group that was used.
First we will filter out the age groups that were not included. Then we will collapse the age groups to those that were used by Lott et al. again using the fct_collpase() function of the forcats package.
Also we will again combine the values across the variables to create a new demographic varaible with 36 levels.
LOTT_AGE_GROUPS_NULL <- c("Under 5 years",
"5 to 9 years")
dem_LOTT <- dem %>%
filter(!(AGE_GROUP %in% LOTT_AGE_GROUPS_NULL) )%>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP,
"10 to 19 years"=c("10 to 14 years",
"15 to 19 years"),
"20 to 29 years"=c("20 to 24 years",
"25 to 29 years"),
"30 to 39 years"=c("30 to 34 years",
"35 to 39 years"),
"40 to 49 years"=c("40 to 44 years",
"45 to 49 years"),
"50 to 64 years"=c("50 to 54 years",
"55 to 59 years",
"60 to 64 years"),
"65 years and over"=c("65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"))) %>%
mutate(AGE_GROUP = str_replace_all(AGE_GROUP," ","_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") %>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE"=PERC_SUB_POP)We can indeed check that we have the correct number of levels for VARIABLE using the distinct() function.
# A tibble: 36 x 1
VARIABLE
<chr>
1 Black_Female_10_to_19_years
2 Black_Female_20_to_29_years
3 Black_Female_30_to_39_years
4 Black_Female_40_to_49_years
5 Black_Female_50_to_64_years
6 Black_Female_65_years_and_over
7 Black_Male_10_to_19_years
8 Black_Male_20_to_29_years
9 Black_Male_30_to_39_years
10 Black_Male_40_to_49_years
# … with 26 more rows
We also have population data for each decade that came from wrangling the demogrphic data.
We again want to combine this data, so let’s again make sure that all the different tibbles have the same column names.
[1] TRUE
[1] TRUE
[1] TRUE
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1980 Alabama 3899671
2 1980 Alaska 404680
3 1980 Arizona 2735840
4 1980 Arkansas 2288809
5 1980 California 23792840
6 1980 Colorado 2909545
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1990 Alabama 4048508
2 1990 Alaska 553120
3 1990 Arizona 3679056
4 1990 Arkansas 2354343
5 1990 California 29950111
6 1990 Colorado 3303862
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 2000 Alabama 4452173
2 2000 Alaska 627963
3 2000 Arizona 5160586
4 2000 Arkansas 2678588
5 2000 California 33987977
6 2000 Colorado 4326921
Looks good!
population_data <- bind_rows(pop_77_79,
pop_80_89,
pop_90_99,
pop_00_10)
population_data <- population_data %>%
mutate(VARIABLE = "Population") %>%
rename("VALUE" = TOT_POP)We could check that we have 51 values for each year by using the count() function of the dplyr package.
# A tibble: 34 x 2
YEAR n
<dbl> <int>
1 1977 51
2 1978 51
3 1979 51
4 1980 51
5 1981 51
6 1982 51
7 1983 51
8 1984 51
9 1985 51
10 1986 51
# … with 24 more rows
Click here to see details about how the plice staffing data was wrangled.
OK, now we will wrangle the police staffing data. We want to limit the data to only the years of interest. Then we will also replace NA values with zero for the male_total_ct and female_total_ct variables using the replace_na() function of the tidyr packge. We will also, use the across() function of the dplyr package to select and mutate both of these columns in this way. Since both of these variables have total_ct in the name and no other variables do, we can use the contains() function of the dplyr package to specify that we want to use these columns instead of listing both out.
avocado… why not 2010….
Rows: 1,439,467
Columns: 21
$ data_year <dbl> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ ori <chr> "AK020045Y", "AL0011000", "AL0160600", "AL01900…
$ pub_agency_name <chr> "Alcohol Beverage Control Board", "Homewood", "…
$ pub_agency_unit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ state_abbr <chr> "AK", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
$ division_name <chr> "Pacific", "East South Central", "East South Ce…
$ region_name <chr> "West", "South", "South", "South", "South", "So…
$ county_name <chr> "N/A", "JEFFERSON", "N/A", "COFFEE", "N/A", "HA…
$ agency_type_name <chr> "Other State Agency", "City", "City", "County",…
$ population_group_desc <chr> "Cities under 2,500", "Cities from 10,000 thru …
$ population <dbl> 0, 20289, 0, 14852, 0, 3081, 0, 18739, 2776, 0,…
$ male_officer_ct <dbl> NA, 17, NA, 0, NA, 0, NA, 0, 0, NA, NA, 0, NA, …
$ male_civilian_ct <dbl> NA, 3, NA, 0, NA, 0, NA, 0, 0, NA, NA, 0, NA, N…
$ male_total_ct <dbl> NA, 20, NA, 0, NA, 0, NA, 0, 0, NA, NA, 0, NA, …
$ female_officer_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ female_civilian_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ female_total_ct <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ officer_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ civilian_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ total_pe_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pe_ct_per_1000 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
ps_data %<>%
filter(data_year >= 1977,
data_year <= 2014) %>%
mutate(across(.cols =contains("total_ct"), ~replace_na(., 0)))
glimpse(ps_data)Rows: 932,063
Columns: 21
$ data_year <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,…
$ ori <chr> "AK0012000", "AK0012300", "AKASP0000", "AL00125…
$ pub_agency_name <chr> "Soldotna", "Kenai", "Alaska State Troopers", "…
$ pub_agency_unit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ state_abbr <chr> "AK", "AK", "AK", "AL", "AL", "AL", "AL", "AL",…
$ division_name <chr> "Pacific", "Pacific", "Pacific", "East South Ce…
$ region_name <chr> "West", "West", "West", "South", "South", "Sout…
$ county_name <chr> "KENAI PENINSULA", "KENAI PENINSULA", "N/A", "J…
$ agency_type_name <chr> "City", "City", "State Police", "City", "Other"…
$ population_group_desc <chr> "Cities under 2,500", "Cities from 2,500 thru 9…
$ population <dbl> 2131, 5800, 172397, 1000, 0, 8611, 2850, 975, 2…
$ male_officer_ct <dbl> 6, 10, 0, 3, NA, 16, 6, 2, 5, NA, NA, 3, 26, 5,…
$ male_civilian_ct <dbl> 6, 0, 0, 0, NA, 5, 0, 0, 0, NA, NA, 0, 0, 0, 4,…
$ male_total_ct <dbl> 12, 10, 0, 3, 0, 21, 6, 2, 5, 0, 0, 3, 26, 5, 6…
$ female_officer_ct <lgl> FALSE, FALSE, NA, FALSE, NA, FALSE, FALSE, FALS…
$ female_civilian_ct <lgl> TRUE, NA, NA, FALSE, NA, FALSE, FALSE, FALSE, F…
$ female_total_ct <dbl> 1, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 13, 0…
$ officer_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ civilian_ct <lgl> NA, NA, NA, FALSE, NA, NA, FALSE, FALSE, FALSE,…
$ total_pe_ct <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pe_ct_per_1000 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Now we can create a new variable called officer_total which will be the sum of these variables. We will then keep just this variable as well as the data_year, pub_agency_name, and state_abbr.
ps_data %<>%
mutate(officer_total = male_total_ct + female_total_ct) %>%
dplyr::select(data_year,
pub_agency_name,
state_abbr,
officer_total)
ps_data# A tibble: 932,063 x 4
data_year pub_agency_name state_abbr officer_total
<dbl> <chr> <chr> <dbl>
1 1977 Soldotna AK 13
2 1977 Kenai AK 15
3 1977 Alaska State Troopers AK 0
4 1977 Trafford AL 3
5 1977 Trussville Fire Department Fire and Explo… AL 0
6 1977 Atmore AL 21
7 1977 East Brewton AL 6
8 1977 Brilliant AL 2
9 1977 Camden AL 5
10 1977 Drug Enforcement Administration, Birmingh… AL 0
# … with 932,053 more rows
Now we also want to get collapse by pub_agency_name to get a total count for each year and each state. So we will do this by using the group_by() function and grouping by data_year and state_abbr and using the summarise() function to calculate a sum.
ps_data %<>%
group_by(data_year, state_abbr) %>%
summarise(officer_state_total=sum(officer_total), .groups = "drop")
ps_data# A tibble: 2,242 x 3
data_year state_abbr officer_state_total
<dbl> <chr> <dbl>
1 1977 AK 544
2 1977 AL 7380
3 1977 AR 3344
4 1977 AS 0
5 1977 AZ 6414
6 1977 CA 65596
7 1977 CO 7337
8 1977 CT 6051
9 1977 CZ 0
10 1977 DC 4751
# … with 2,232 more rows
And we will check that we have same number of values (the number of years included in the data) for each state.
# A tibble: 6 x 2
state_abbr n
<chr> <int>
1 AK 38
2 AL 38
3 AR 38
4 AS 38
5 AZ 38
6 CA 38
[1] 0 2
Looks like all the states have 38 values.
Notice also that there are some unusual abrreviations in the state_abbr variable.
We will remove data for US terroitories and associated states
| Abbreviation | Territory and associated states |
|---|---|
| AS | American Samoa |
| GM | Guam |
| CZ | Canal Zone |
| FS | ??Federated States of Micronesia (usually FM) |
| MP | Northern Mariana Islands |
| OT | ??U.S. Minor Outlying Islands (usually UM) |
| PR | Puerto Rico |
| VI | Virgin Islands |
state_of_interest_NULL <- c("AS",
"GM",
"CZ",
"FS",
"MP",
"OT",
"PR",
"VI")
ps_data <- ps_data %>%
filter(!(state_abbr %in% state_of_interest_NULL)) Within the datasets package that is loaded with R, there is a dataset called state that contains an object called state.abb that has the state abbreviations and state.name that has the state names. We will combine these now to add the state names to our data.
# A tibble: 6 x 2
state_abbr STATE
<chr> <chr>
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
One unusual thing about this data is that NE is used for Nebraska to avoid confusions with NB in Canada. So we want to repace that using the str_replace() function of the stringr package
state_abb_data %<>%
mutate(state_abbr = str_replace(string = state_abbr,
pattern = "NE",
replacement = "NB"))# A tibble: 50 x 2
state_abbr STATE
<chr> <chr>
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
7 CT Connecticut
8 DE Delaware
9 FL Florida
10 GA Georgia
11 HI Hawaii
12 ID Idaho
13 IL Illinois
14 IN Indiana
15 IA Iowa
16 KS Kansas
17 KY Kentucky
18 LA Louisiana
19 ME Maine
20 MD Maryland
21 MA Massachusetts
22 MI Michigan
23 MN Minnesota
24 MS Mississippi
25 MO Missouri
26 MT Montana
27 NB Nebraska
28 NV Nevada
29 NH New Hampshire
30 NJ New Jersey
31 NM New Mexico
32 NY New York
33 NC North Carolina
34 ND North Dakota
35 OH Ohio
36 OK Oklahoma
37 OR Oregon
38 PA Pennsylvania
39 RI Rhode Island
40 SC South Carolina
41 SD South Dakota
42 TN Tennessee
43 TX Texas
44 UT Utah
45 VT Vermont
46 VA Virginia
47 WA Washington
48 WV West Virginia
49 WI Wisconsin
50 WY Wyoming
We need to add DC to this. We will use the add_row() function of dplyr to do this. We just need to specify values for both of the variables.
Now we will add this to our police staffing data and then remove the state_abbr variable, so that we just have state names. We will also
# A tibble: 1,938 x 3
data_year officer_state_total STATE
<dbl> <dbl> <chr>
1 1977 544 Alaska
2 1977 7380 Alabama
3 1977 3344 Arkansas
4 1977 6414 Arizona
5 1977 65596 California
6 1977 7337 Colorado
7 1977 6051 Connecticut
8 1977 4751 District of Columbia
9 1977 1018 Delaware
10 1977 24588 Florida
# … with 1,928 more rows
Now we will rename the variables to match those of the other datasets.
ps_data %<>%
rename(YEAR = "data_year",
VALUE = "officer_state_total") %>%
mutate(VARIABLE = "officer_state_total")
ps_data# A tibble: 1,938 x 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 544 Alaska officer_state_total
2 1977 7380 Alabama officer_state_total
3 1977 3344 Arkansas officer_state_total
4 1977 6414 Arizona officer_state_total
5 1977 65596 California officer_state_total
6 1977 7337 Colorado officer_state_total
7 1977 6051 Connecticut officer_state_total
8 1977 4751 District of Columbia officer_state_total
9 1977 1018 Delaware officer_state_total
10 1977 24588 Florida officer_state_total
# … with 1,928 more rows
We also need to adjust the value to be that of every 100,000 people in the state. To do so we need the population for each state, which lukily we already have. We will slightly modify the population data and create a new tibble that will make it more clear how we are dividing by it.
denominator_temp <- population_data %>%
select(-VARIABLE) %>%
rename("Population_temp"=VALUE)
head(denominator_temp)# A tibble: 6 x 3
YEAR STATE Population_temp
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
# A tibble: 6 x 5
YEAR VALUE STATE VARIABLE Population_temp
<dbl> <dbl> <chr> <chr> <dbl>
1 1977 544 Alaska officer_state_total 397220
2 1977 7380 Alabama officer_state_total 3782571
3 1977 3344 Arkansas officer_state_total 2207195
4 1977 6414 Arizona officer_state_total 2427296
5 1977 65596 California officer_state_total 22350332
6 1977 7337 Colorado officer_state_total 2696179
Avocado not sure why the lag?
ps_data %<>%
mutate(VALUE = (VALUE * 100000) / Population_temp) %>%
#mutate(VALUE = lag(VALUE)) %>%
mutate(VARIABLE = "police_per_100k_lag") %>%
select(-Population_temp)
ps_data# A tibble: 1,938 x 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 137. Alaska police_per_100k_lag
2 1977 195. Alabama police_per_100k_lag
3 1977 152. Arkansas police_per_100k_lag
4 1977 264. Arizona police_per_100k_lag
5 1977 293. California police_per_100k_lag
6 1977 272. Colorado police_per_100k_lag
7 1977 196. Connecticut police_per_100k_lag
8 1977 697. District of Columbia police_per_100k_lag
9 1977 171. Delaware police_per_100k_lag
10 1977 277. Florida police_per_100k_lag
# … with 1,928 more rows
The first thing we need to do with the unemployment data is combine the data across the different states. We can do that using the bind_rows() function of dplyr which appends the data together based on the presence of columns with the same name in the different tibbles. We will use the map_df() function of the purrr package to allow us to do this across each tibble in our list. We will then select just the annual data for each state and year and we will rename our variables to be consistent with some of other data that we are working with. Thus we would like our variables to be YEAR, VALUE and VARIABLE in all caps.
# A tibble: 6 x 15
STATE Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alaba… 1977 7.5 9 7.7 7.2 6.8 8.6 8 7.8 6.7 6.3 6.3
2 Alaba… 1978 7.1 6.9 6.2 5.4 5.1 6.9 6.7 6.7 6.5 6.3 6.3
3 Alaba… 1979 6.7 7.5 6.9 6.6 6.4 8.4 7.7 7.8 7.1 7.2 6.9
4 Alaba… 1980 7.7 7.8 7.4 7.4 8.4 9.7 10.4 10.3 9.3 9.6 9.4
5 Alaba… 1981 10 10.3 9.5 9.1 9.4 11.1 10.4 10.9 10.8 11.7 11.5
6 Alaba… 1982 13.2 13.2 12.9 12.6 12.8 14.5 14.7 14.8 14.7 15.1 15.4
# … with 2 more variables: Dec <dbl>, Annual <dbl>
ue_rate_data <- ue_rate_data %>%
dplyr::select(STATE, Year, Annual) %>%
rename("YEAR" = Year,
"VALUE" = Annual) %>%
mutate(VARIABLE = "Unemployment_rate")
head(ue_rate_data)# A tibble: 6 x 4
STATE YEAR VALUE VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 1977 7.3 Unemployment_rate
2 Alabama 1978 6.4 Unemployment_rate
3 Alabama 1979 7.2 Unemployment_rate
4 Alabama 1980 8.9 Unemployment_rate
5 Alabama 1981 10.6 Unemployment_rate
6 Alabama 1982 14.1 Unemployment_rate
Click here to see details about how the poverty data was wrangled
OK, now for wrangling the poverty data. First let’s take a look at it.
# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
We can see that the column names are actually shifted downbelow the row with the year. So we will manually make these values the actual column names.
colnames(poverty_rate_data) <- c("STATE",
"Total",
"Number",
"Number_se",
"Percent",
"Percent_se")
poverty_rate_data2 <-poverty_rate_dataLet’s also remove the rows where the column names are listed, like row number 2.
# A tibble: 6 x 6
STATE Total Number Number_se Percent Percent_se
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 Alabama 4877 779 65 16 1.3
3 Alaska 720 94 9 13.1 1.2
4 Arizona 7241 929 80 12.800000000000001 1.1000000000000001
5 Arkansas 2912 462 38 15.9 1.3
6 California 39150 4664 184 11.9 0.5
We can also see that there are some extra notes at the end of our data. This is why it is a good idea to look at both the head and tail of your data.
# A tibble: 6 x 6
STATE Total Number Number_se Percent Percent_se
<chr> <chr> <chr> <chr> <chr> <chr>
1 Wisconsin 4724 403 57 8.5 1.1000000000…
2 Wyoming 468 49 20 10.4 4
3 Standard errors shown in this ta… <NA> <NA> <NA> <NA> <NA>
4 For information on confidentiali… <NA> <NA> <NA> <NA> <NA>
5 Footnotes are available at <www.… <NA> <NA> <NA> <NA> <NA>
6 SOURCE: U.S. Bureau of the Censu… <NA> <NA> <NA> <NA> <NA>
We can see that the strings for the state for these rows are very long. We can also see that there are rows that just have the year, where the state is only 4 characters long. We will create a new variable called length_state based on the number of characters in the STATE values. We will use the str_length() function of the stringr package. We need to use the map_dbl() function to apply this to each row of the STATE variable. The map() function creates a list, whereas the map_dbl() function creates a vector of class double. If we were to use map() we would need to use unlist() and pull().
poverty_rate_data %<>%
mutate(length_state = map_dbl(STATE, str_length))
# Alternatively with map()
#poverty_rate_data %<>%
#mutate(length_state = unlist(map(pull(poverty_rate_data, STATE), str_length)))
poverty_rate_data# A tibble: 2,136 x 7
STATE Total Number Number_se Percent Percent_se length_state
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2018 <NA> <NA> <NA> <NA> <NA> 4
2 Alabama 4877 779 65 16 1.3 7
3 Alaska 720 94 9 13.1 1.2 6
4 Arizona 7241 929 80 12.80000000… 1.10000000000… 7
5 Arkansas 2912 462 38 15.9 1.3 8
6 California 39150 4664 184 11.9 0.5 10
7 Colorado 5739 521 51 9.099999999… 0.90000000000… 8
8 Connecticut 3413 348 43 10.19999999… 1.3 11
9 Delaware 976 72 9 7.400000000… 1 8
10 District of … 692 102 8 14.69999999… 1.10000000000… 20
# … with 2,126 more rows
Great, now let’s look at the tail with our new variable length_state
[1] 9 7 285 164 129 101
# A tibble: 6 x 7
STATE Total Number Number_se Percent Percent_se length_state
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Vermont 520 62 22 12 4 7
2 Virginia 5204 647 72 12.4 1.3 8
3 Washington 4223 538 65 12.6999999999… 1.399999999999… 10
4 West Virgi… 1952 297 49 15.1999999999… 2.299999999999… 13
5 Wisconsin 4724 403 57 8.5 1.100000000000… 9
6 Wyoming 468 49 20 10.4 4 7
Looks good!
Now let’s select all the states that are actually year values to create a new variable about the year. We can do so by using the str_detect() function of the stringr package to look for digits or values of 0-9. This is indicated by using the "[:digit:]".
As you can see in the RStudio cheatsheet about regular expressions this notation indicates any digit between 0 and 9.
# Scroll through the output!
poverty_rate_data %>%
filter(str_detect(STATE, "[:digit:]")) %>%
print(n = 51)# A tibble: 41 x 7
STATE Total Number Number_se Percent Percent_se length_state
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2018 <NA> <NA> <NA> <NA> <NA> 4
2 2017 (21) <NA> <NA> <NA> <NA> <NA> 9
3 2017 <NA> <NA> <NA> <NA> <NA> 4
4 2016 <NA> <NA> <NA> <NA> <NA> 4
5 2015 <NA> <NA> <NA> <NA> <NA> 4
6 2014 <NA> <NA> <NA> <NA> <NA> 4
7 2013 (19) <NA> <NA> <NA> <NA> <NA> 9
8 2013 (18) <NA> <NA> <NA> <NA> <NA> 9
9 2012 <NA> <NA> <NA> <NA> <NA> 4
10 2011 <NA> <NA> <NA> <NA> <NA> 4
11 2010 (17) <NA> <NA> <NA> <NA> <NA> 9
12 2009 <NA> <NA> <NA> <NA> <NA> 4
13 2008 <NA> <NA> <NA> <NA> <NA> 4
14 2007 <NA> <NA> <NA> <NA> <NA> 4
15 2006 <NA> <NA> <NA> <NA> <NA> 4
16 2005 <NA> <NA> <NA> <NA> <NA> 4
17 2004 (14) <NA> <NA> <NA> <NA> <NA> 9
18 2003 <NA> <NA> <NA> <NA> <NA> 4
19 2002 <NA> <NA> <NA> <NA> <NA> 4
20 2001 <NA> <NA> <NA> <NA> <NA> 4
21 2000 (12) <NA> <NA> <NA> <NA> <NA> 9
22 1999 (11) <NA> <NA> <NA> <NA> <NA> 9
23 1998 <NA> <NA> <NA> <NA> <NA> 4
24 1997 <NA> <NA> <NA> <NA> <NA> 4
25 1996 <NA> <NA> <NA> <NA> <NA> 4
26 1995 <NA> <NA> <NA> <NA> <NA> 4
27 1994 <NA> <NA> <NA> <NA> <NA> 4
28 1993 (10) <NA> <NA> <NA> <NA> <NA> 9
29 1992 (9) <NA> <NA> <NA> <NA> <NA> 8
30 1991 (8) <NA> <NA> <NA> <NA> <NA> 8
31 1990 <NA> <NA> <NA> <NA> <NA> 4
32 1989 <NA> <NA> <NA> <NA> <NA> 4
33 1988 <NA> <NA> <NA> <NA> <NA> 4
34 1987 (7) <NA> <NA> <NA> <NA> <NA> 8
35 1986 <NA> <NA> <NA> <NA> <NA> 4
36 1985 <NA> <NA> <NA> <NA> <NA> 4
37 1984 <NA> <NA> <NA> <NA> <NA> 4
38 1983 (6) <NA> <NA> <NA> <NA> <NA> 8
39 1982 <NA> <NA> <NA> <NA> <NA> 4
40 1981 (5) <NA> <NA> <NA> <NA> <NA> 8
41 1980 <NA> <NA> <NA> <NA> <NA> 4
Some of the years (2013 and 2017) are listed twice with a number in parantheses, others are just listed once with a number in parantheses. Looking at the technical documentation, this seems to do with updates to the defition of poverty and to the methods used to estimate poverty levels. See here and here for more information. We will simply select one of the sets of data for 2013 and 2017.
# A tibble: 0 x 7
# … with 7 variables: STATE <chr>, Total <chr>, Number <chr>, Number_se <chr>,
# Percent <chr>, Percent_se <chr>, length_state <dbl>
First let’s add the year value to our data.
There should be consistently data for 51 states (including DC). We can see that sometimes DC is spelled out and sometimes it is not.
### Scroll through the output!
poverty_rate_data %>%
filter(str_detect(STATE, "[:alpha:]")) %>%
distinct(STATE) %>% print(n = 100)# A tibble: 52 x 1
STATE
<chr>
1 Alabama
2 Alaska
3 Arizona
4 Arkansas
5 California
6 Colorado
7 Connecticut
8 Delaware
9 District of Columbia
10 Florida
11 Georgia
12 Hawaii
13 Idaho
14 Illinois
15 Indiana
16 Iowa
17 Kansas
18 Kentucky
19 Louisiana
20 Maine
21 Maryland
22 Massachusetts
23 Michigan
24 Minnesota
25 Mississippi
26 Missouri
27 Montana
28 Nebraska
29 Nevada
30 New Hampshire
31 New Jersey
32 New Mexico
33 New York
34 North Carolina
35 North Dakota
36 Ohio
37 Oklahoma
38 Oregon
39 Pennsylvania
40 Rhode Island
41 South Carolina
42 South Dakota
43 Tennessee
44 Texas
45 Utah
46 Vermont
47 Virginia
48 Washington
49 West Virginia
50 Wisconsin
51 Wyoming
52 D.C.
Now we will replace "D.C." with "District of Columbia" using str_replace(). We can use the tally() function of the dplyr package to check that we have fewer now.
poverty_rate_data %<>%
mutate(STATE = str_replace(STATE, pattern = "D.C.",
replacement = "District of Columbia" ))
poverty_rate_data %>%
filter(str_detect(STATE, "[:alpha:]")) %>%
distinct(STATE) %>% tally()# A tibble: 1 x 1
n
<int>
1 51
Great! Now are each of the states occurring as often as the unique year values? We can first check how many year values there are. Then can use the count() function of the dplyr package to check how often the states are repeated.
# A tibble: 1 x 1
n
<int>
1 41
There are 41 different sets of data according to year values.
### Scroll through the output!
poverty_rate_data %>%
filter(str_detect(STATE, "[:alpha:]")) %>%
count(STATE) %>%
print(n = 51)# A tibble: 51 x 2
STATE n
<chr> <int>
1 Alabama 41
2 Alaska 41
3 Arizona 41
4 Arkansas 41
5 California 41
6 Colorado 41
7 Connecticut 41
8 Delaware 41
9 District of Columbia 41
10 Florida 41
11 Georgia 41
12 Hawaii 41
13 Idaho 41
14 Illinois 41
15 Indiana 41
16 Iowa 41
17 Kansas 41
18 Kentucky 41
19 Louisiana 41
20 Maine 41
21 Maryland 41
22 Massachusetts 41
23 Michigan 41
24 Minnesota 41
25 Mississippi 41
26 Missouri 41
27 Montana 41
28 Nebraska 41
29 Nevada 41
30 New Hampshire 41
31 New Jersey 41
32 New Mexico 41
33 New York 41
34 North Carolina 41
35 North Dakota 41
36 Ohio 41
37 Oklahoma 41
38 Oregon 41
39 Pennsylvania 41
40 Rhode Island 41
41 South Carolina 41
42 South Dakota 41
43 Tennessee 41
44 Texas 41
45 Utah 41
46 Vermont 41
47 Virginia 41
48 Washington 41
49 West Virginia 41
50 Wisconsin 41
51 Wyoming 41
Indeed, looks like each of the states are repeated the same number of times!
Now let’s create a new variable YEAR that repeats the year values for all of the different states and for the row that has just the year value for a total of 52.
year_values <- poverty_rate_data %>%
filter(str_detect(STATE, "[:digit:]")) %>%
distinct(STATE)
year_values<-rep(pull(year_values, STATE), each = 52)
setequal(length(year_values), length(poverty_rate_data$STATE))[1] TRUE
Now we will add this to our poverty_rate_data. We will also remove the length_state variable using the select() function of the dplyr package and a minus sign before the variable name.
# A tibble: 2,132 x 7
STATE Total Number Number_se Percent Percent_se year_value
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA> 2018
2 Alabama 4877 779 65 16 1.3 2018
3 Alaska 720 94 9 13.1 1.2 2018
4 Arizona 7241 929 80 12.800000000… 1.10000000000… 2018
5 Arkansas 2912 462 38 15.9 1.3 2018
6 California 39150 4664 184 11.9 0.5 2018
7 Colorado 5739 521 51 9.0999999999… 0.90000000000… 2018
8 Connecticut 3413 348 43 10.199999999… 1.3 2018
9 Delaware 976 72 9 7.4000000000… 1 2018
10 District of … 692 102 8 14.699999999… 1.10000000000… 2018
11 Florida 21117 2883 173 13.699999999… 0.80000000000… 2018
12 Georgia 10423 1548 115 14.800000000… 1.10000000000… 2018
13 Hawaii 1393 128 16 9.1999999999… 1.10000000000… 2018
14 Idaho 1766 202 25 11.5 1.39999999999… 2018
15 Illinois 12589 1292 130 10.300000000… 1 2018
16 Indiana 6582 761 68 11.6 1 2018
17 Iowa 3110 277 30 8.9000000000… 1 2018
18 Kansas 2835 212 27 7.5 0.90000000000… 2018
19 Kentucky 4445 696 76 15.699999999… 1.7 2018
20 Louisiana 4510 858 38 19 0.80000000000… 2018
21 Maine 1321 153 19 11.6 1.39999999999… 2018
22 Maryland 6030 480 50 8 0.80000000000… 2018
23 Massachusetts 6883 601 59 8.6999999999… 0.90000000000… 2018
24 Michigan 9913 1036 78 10.5 0.80000000000… 2018
25 Minnesota 5746 456 47 7.9000000000… 0.80000000000… 2018
26 Mississippi 2899 567 27 19.600000000… 0.90000000000… 2018
27 Missouri 6026 745 81 12.4 1.3 2018
28 Montana 1041 107 12 10.300000000… 1.10000000000… 2018
29 Nebraska 1893 199 25 10.5 1.3 2018
30 Nevada 3008 390 37 13 1.2 2018
31 New Hampshire 1349 82 11 6.0999999999… 0.80000000000… 2018
32 New Jersey 8790 725 65 8.1999999999… 0.69999999999… 2018
33 New Mexico 2054 342 27 16.600000000… 1.3 2018
34 New York 19343 2145 110 11.1 0.59999999999… 2018
35 North Caroli… 10369 1355 96 13.1 0.90000000000… 2018
36 North Dakota 745 72 9 9.6999999999… 1.2 2018
37 Ohio 11452 1365 115 11.9 1 2018
38 Oklahoma 3858 518 48 13.4 1.2 2018
39 Oregon 4172 404 45 9.6999999999… 1.10000000000… 2018
40 Pennsylvania 12519 1476 124 11.800000000… 1 2018
41 Rhode Island 1036 92 15 8.9000000000… 1.39999999999… 2018
42 South Caroli… 5036 642 45 12.800000000… 0.90000000000… 2018
43 South Dakota 853 90 8 10.6 0.90000000000… 2018
44 Tennessee 6665 800 73 12 1.10000000000… 2018
45 Texas 28497 3894 173 13.699999999… 0.59999999999… 2018
46 Utah 3173 219 37 6.9000000000… 1.2 2018
47 Vermont 616 60 7 9.6999999999… 1.10000000000… 2018
48 Virginia 8393 821 86 9.8000000000… 1 2018
49 Washington 7555 647 101 8.5999999999… 1.3 2018
50 West Virginia 1762 279 21 15.9 1.2 2018
51 Wisconsin 5795 499 54 8.5999999999… 0.90000000000… 2018
52 Wyoming 565 53 6 9.5 1.10000000000… 2018
53 2017 (21) <NA> <NA> <NA> <NA> <NA> 2017 (21)
54 Alabama 4801 735 58 15.300000000… 1.2 2017 (21)
55 Alaska 719 87 12 12.1 1.7 2017 (21)
56 Arizona 6981 951 104 13.6 1.5 2017 (21)
57 Arkansas 2921 436 33 14.9 1.10000000000… 2017 (21)
58 California 39247 4759 182 12.1 0.5 2017 (21)
59 Colorado 5527 489 51 8.9000000000… 0.90000000000… 2017 (21)
60 Connecticut 3553 377 43 10.6 1.2 2017 (21)
61 Delaware 967 85 10 8.8000000000… 1.10000000000… 2017 (21)
62 District of … 691 96 8 13.9 1.10000000000… 2017 (21)
63 Florida 20909 2809 173 13.4 0.80000000000… 2017 (21)
64 Georgia 10231 1339 104 13.1 1 2017 (21)
65 Hawaii 1402 149 17 10.6 1.2 2017 (21)
66 Idaho 1730 199 23 11.5 1.3 2017 (21)
67 Illinois 12597 1454 102 11.5 0.80000000000… 2017 (21)
68 Indiana 6532 762 70 11.699999999… 1.10000000000… 2017 (21)
69 Iowa 3053 230 33 7.5 1.10000000000… 2017 (21)
70 Kansas 2868 411 33 14.300000000… 1.2 2017 (21)
71 Kentucky 4395 591 76 13.5 1.7 2017 (21)
72 Louisiana 4535 931 48 20.5 1.10000000000… 2017 (21)
73 Maine 1315 163 19 12.4 1.39999999999… 2017 (21)
74 Maryland 5977 451 58 7.5999999999… 1 2017 (21)
75 Massachusetts 6784 763 71 11.199999999… 1 2017 (21)
76 Michigan 9895 1135 88 11.5 0.90000000000… 2017 (21)
77 Minnesota 5619 479 57 8.5 1 2017 (21)
78 Mississippi 2948 544 27 18.5 0.90000000000… 2017 (21)
79 Missouri 5988 683 81 11.4 1.39999999999… 2017 (21)
80 Montana 1041 107 11 10.300000000… 1 2017 (21)
81 Nebraska 1878 216 25 11.5 1.3 2017 (21)
82 Nevada 2979 392 37 13.199999999… 1.2 2017 (21)
83 New Hampshire 1333 95 13 7.2000000000… 1 2017 (21)
84 New Jersey 9015 894 95 9.9000000000… 1.10000000000… 2017 (21)
85 New Mexico 2035 402 28 19.699999999… 1.39999999999… 2017 (21)
86 New York 19735 2510 141 12.699999999… 0.69999999999… 2017 (21)
87 North Caroli… 10297 1567 96 15.199999999… 0.90000000000… 2017 (21)
88 North Dakota 740 92 11 12.4 1.5 2017 (21)
89 Ohio 11491 1479 98 12.9 0.90000000000… 2017 (21)
90 Oklahoma 3817 490 47 12.800000000… 1.2 2017 (21)
91 Oregon 4202 482 55 11.5 1.3 2017 (21)
92 Pennsylvania 12568 1373 94 10.9 0.80000000000… 2017 (21)
93 Rhode Island 1046 118 15 11.300000000… 1.39999999999… 2017 (21)
94 South Caroli… 4955 756 55 15.199999999… 1.10000000000… 2017 (21)
95 South Dakota 870 93 15 10.699999999… 1.7 2017 (21)
96 Tennessee 6699 759 65 11.300000000… 1 2017 (21)
97 Texas 28090 3715 186 13.199999999… 0.69999999999… 2017 (21)
98 Utah 3130 272 34 8.6999999999… 1.10000000000… 2017 (21)
99 Vermont 613 53 6 8.5999999999… 1 2017 (21)
100 Virginia 8242 862 71 10.5 0.90000000000… 2017 (21)
# … with 2,032 more rows
Looks good! Now we will remove the rows that have just the year values by only preserving those with alpha characters.
Now let’s remove the older data for 2013 and 2017 which is the data that appears lower in the tibble.
We also want to just keep the first 4 digits of the year_value and create a YEAR variable. We need to pull the year_value data because str_sub() expects a character vector not a tibble.
# A tibble: 1,989 x 8
STATE Total Number Number_se Percent Percent_se year_value YEAR
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Alabama 4877 779 65 16 1.3 2018 2018
2 Alaska 720 94 9 13.1 1.2 2018 2018
3 Arizona 7241 929 80 12.8000000… 1.1000000000… 2018 2018
4 Arkansas 2912 462 38 15.9 1.3 2018 2018
5 California 39150 4664 184 11.9 0.5 2018 2018
6 Colorado 5739 521 51 9.09999999… 0.9000000000… 2018 2018
7 Connecticut 3413 348 43 10.1999999… 1.3 2018 2018
8 Delaware 976 72 9 7.40000000… 1 2018 2018
9 District o… 692 102 8 14.6999999… 1.1000000000… 2018 2018
10 Florida 21117 2883 173 13.6999999… 0.8000000000… 2018 2018
# … with 1,979 more rows
Looks good! Now we will just remove the extra variables and rename the variables we want to keep to be similar to our other data.
poverty_rate_data %<>%
dplyr::select(- Number,
- Number_se,
- Percent_se,
- Total,
- year_value) %>%
rename("VALUE" = Percent) %>%
mutate(VARIABLE = "Poverty_rate",
YEAR = as.numeric(YEAR),
VALUE = as.numeric(VALUE))
head(poverty_rate_data)# A tibble: 6 x 4
STATE VALUE YEAR VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 16 2018 Poverty_rate
2 Alaska 13.1 2018 Poverty_rate
3 Arizona 12.8 2018 Poverty_rate
4 Arkansas 15.9 2018 Poverty_rate
5 California 11.9 2018 Poverty_rate
6 Colorado 9.1 2018 Poverty_rate
Looks great! AVOCADO this data is per 1000k? I think I need to restrict to 2014 to match the other data and thats why I get an error when making the Donohue_df
From Michael:
poverty_rate_data <-poverty_rate_data2
notes <- 4
poverty_rate_data2 <- poverty_rate_data[-((dim(poverty_rate_data)[1]-notes+1):dim(poverty_rate_data)[1]),]
states_eq <- 51
extra_col <- 2
rep_rows <- states_eq + extra_col
groups <- (dim(poverty_rate_data)[1])/(rep_rows)
paste(groups - (2018-1980 + 1), "extra groups")
poverty_rate_data$year_group <- rep(1:groups, each=rep_rows)
poverty_rate_data <- poverty_rate_data %>%
group_by(year_group) %>%
group_split()
head(poverty_rate_data[[1]])
poverty_rate_data <- poverty_rate_data %>%
map(~mutate(.,
row_id = row_number())) %>%
map(~filter(.,row_id != 2)) %>%
map(~dplyr::select(.,-row_id))
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_replace_all(.,"[:space:]","_")
names(poverty_rate_data) <- poverty_rate_data_names
# Recall 2 extra groups.
# footnotes available at https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-footnotes/cps-historic-footnotes.html
poverty_rate_data$`2017_(21)` <- NULL
poverty_rate_data$`2013_(19)` <- NULL
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_sub(., start = 1, end=4)
names(poverty_rate_data) <- poverty_rate_data_names
poverty_rate_data <- poverty_rate_data %>%
map_df(bind_rows, .id = "YEAR") %>%
dplyr::select(-year_group)
poverty_rate_data <- poverty_rate_data %>%
mutate(n_na = rowSums(is.na(.)))
# This shows that there is systematic missing values stemmingly *solely* from the rows without poverty data and only a label designating the year
poverty_rate_data %>%
group_by(n_na) %>%
tally()
sapply(poverty_rate_data, class)
colnames(poverty_rate_data) Click here to see details about how the violent crime data was wrangled
The crime_data was importated using read_lines() and we have some lines that we don’t necessarily need. A large amount of the original data is notes at the end of the table. We want to remove these lines. We can determine where they start by searching for the row that contains the first statement of these notes using the str_which() function of the stringr package. We will subtract one from this as there is a blank line in between.
[1] ",,\"Vermont - The state UCR Program was unable to provide complete 1997 offense figures in accordance with UCR guidelines. The 1996 and 1997 percent changes within the geographic division in which Vermont is categorized were applied to the valid 1996 state total to effect the 1997 state total.\""
[2] " "
[3] " "
[4] ",,\"Wisconsin - The state UCR Program was unable to provide complete 1998 offense figures in accordance with UCR guidelines. The state total was estimated by using 1997 figures for the nonreporting areas and applying 1997 versus 1998 percentage changes for the division in which Wisconsin is located. The estimates for the nonreporting areas were then increased by any actual 1998 crime counts received.\""
[5] " "
[6] "\"Sources: FBI, Uniform Crime Reports, prepared by the National Archive of Criminal Justice Data\" "
crime_data <- crime_data[-((str_which(crime_data, "The figures shown in this column for the offense of rape were estimated using the legacy UCR definition of rape")-1): length(crime_data))]
#crime_data <- crime_data[-(2143:length(crime_data))]
tail(crime_data)[1] "2009, 544270, 1196, 11, 172,, 78, 935 "
[2] "2010, 564554, 1117, 8, 162,, 77, 870 "
[3] "2011, 567356, 1245, 18, 146,, 71, 1010 "
[4] "2012, 576626, 1161, 14, 154,, 61, 932 "
[5] "2013, 583223, 1212, 17, 144, 204, 74, 917 "
[6] "2014, 584153, 1142, 16, 126, 174, 53, 899 "
There are lines for each year from 1977 to 2014 as well as four lines about each state and the header information for each state. Here you can see what the orginal data looks like:
We want to delete the header information and only retain the lines numeric values or state names. Thus since there are 38 years worth of data for each state and 4 lines for each header, then each state has 42 lines. We want to delete the lines between and including line 2 to 4 for each state. We will save the header information once to use later.
[1] "Estimated crime in Alabama"
[2] "\n,,National or state crime,,,,,,,"
[3] "\n,,Violent crime,,,,,,,"
[4] "\nYear,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
[5] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[6] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
So starting at line 2 and and 3 and 4 we create a sequence of numbers that increase by the number of rows of the length of the individual state data. We can do so using the base seq() function. We can take a look at these in order using the base sort() function.
delete_rows <- c(seq(from = 2,
to = length(crime_data),
by = rep_cycle),
seq(from = 3,
to = length(crime_data),
by = rep_cycle),
seq(from = 4,
to = length(crime_data),
by = rep_cycle))
sort(delete_rows) [1] 2 3 4 44 45 46 86 87 88 128 129 130 170 171 172
[16] 212 213 214 254 255 256 296 297 298 338 339 340 380 381 382
[31] 422 423 424 464 465 466 506 507 508 548 549 550 590 591 592
[46] 632 633 634 674 675 676 716 717 718 758 759 760 800 801 802
[61] 842 843 844 884 885 886 926 927 928 968 969 970 1010 1011 1012
[76] 1052 1053 1054 1094 1095 1096 1136 1137 1138 1178 1179 1180 1220 1221 1222
[91] 1262 1263 1264 1304 1305 1306 1346 1347 1348 1388 1389 1390 1430 1431 1432
[106] 1472 1473 1474 1514 1515 1516 1556 1557 1558 1598 1599 1600 1640 1641 1642
[121] 1682 1683 1684 1724 1725 1726 1766 1767 1768 1808 1809 1810 1850 1851 1852
[136] 1892 1893 1894 1934 1935 1936 1976 1977 1978 2018 2019 2020 2060 2061 2062
[151] 2102 2103 2104
Thus we will delete lines 2, 3, and 4 and then skip 40 lines (to account for the state information for the first state, the lines of information for the 38 years, and then the state information for the next state) and then delete the next 3 consecutive lines and so on. We can indeed see that line 44-46 are what we wish to remove.
[1] "\n,,National or state crime,,,,,,,"
[2] "\n,,Violent crime,,,,,,,"
[3] "\nYear,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
Nice!
Now we can select all the lines that have state information. We can repeat each of these for the 38 years for each state as well as this line that contains the state information by using the base rep() function with the each = argument. Finally we will remove the "Estimated crime in " portion of the string using the str_remove() function of the stringr package. We will later combine this with the crime data.
[1] "Estimated crime in Alabama"
[2] "Estimated crime in Alaska"
[3] "Estimated crime in Arizona"
[4] "Estimated crime in Arkansas"
[5] "Estimated crime in California"
[6] "Estimated crime in Colorado"
[7] "Estimated crime in Connecticut"
[8] "Estimated crime in Delaware"
[9] "Estimated crime in District of Columbia"
[10] "Estimated crime in Florida"
[11] "Estimated crime in Georgia"
[12] "Estimated crime in Hawaii"
[13] "Estimated crime in Idaho"
[14] "Estimated crime in Illinois"
[15] "Estimated crime in Indiana"
[16] "Estimated crime in Iowa"
[17] "Estimated crime in Kansas"
[18] "Estimated crime in Kentucky"
[19] "Estimated crime in Louisiana"
[20] "Estimated crime in Maine"
[21] "Estimated crime in Maryland"
[22] "Estimated crime in Massachusetts"
[23] "Estimated crime in Michigan"
[24] "Estimated crime in Minnesota"
[25] "Estimated crime in Mississippi"
[26] "Estimated crime in Missouri"
[27] "Estimated crime in Montana"
[28] "Estimated crime in Nebraska"
[29] "Estimated crime in Nevada"
[30] "Estimated crime in New Hampshire"
[31] "Estimated crime in New Jersey"
[32] "Estimated crime in New Mexico"
[33] "Estimated crime in New York"
[34] "Estimated crime in North Carolina"
[35] "Estimated crime in North Dakota"
[36] "Estimated crime in Ohio"
[37] "Estimated crime in Oklahoma"
[38] "Estimated crime in Oregon"
[39] "Estimated crime in Pennsylvania"
[40] "Estimated crime in Rhode Island"
[41] "Estimated crime in South Carolina"
[42] "Estimated crime in South Dakota"
[43] "Estimated crime in Tennessee"
[44] "Estimated crime in Texas"
[45] "Estimated crime in Utah"
[46] "Estimated crime in Vermont"
[47] "Estimated crime in Virginia"
[48] "Estimated crime in Washington"
[49] "Estimated crime in West Virginia"
[50] "Estimated crime in Wisconsin"
[51] "Estimated crime in Wyoming"
state_label_order <- rep(state_label_order, each = 38)
crime_states <-str_remove(state_label_order, pattern = "Estimated crime in ")
head(crime_states)[1] "Alabama" "Alabama" "Alabama" "Alabama" "Alabama" "Alabama"
Nice! Now for the rest of the data. We now need to remove the lines with the state information.
[1] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[2] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
[3] "1979, 3769000, 15578, 496, 1037,, 4127, 9918 "
[4] "1980, 3861466, 17320, 509, 1158,, 5102, 10551 "
[5] "1981, 3916000, 18423, 465, 1021,, 4952, 11985 "
[6] "1982, 3943000, 17653, 417, 1026,, 4417, 11793 "
[1] "2009, 544270, 1196, 11, 172,, 78, 935 "
[2] "2010, 564554, 1117, 8, 162,, 77, 870 "
[3] "2011, 567356, 1245, 18, 146,, 71, 1010 "
[4] "2012, 576626, 1161, 14, 154,, 61, 932 "
[5] "2013, 583223, 1212, 17, 144, 204, 74, 917 "
[6] "2014, 584153, 1142, 16, 126, 174, 53, 899 "
It appears that the data is comma separted with 8 columns. One of the middle columns often has no values, we need to fill these in with NAs. We can use the read_csv() fucntion from the readr package to do this. It turns out you don’t have to have a file, but you can also use a string our a vector.
[1] "1977, 3690000, 15293, 524, 929,, 3572, 10268 "
[2] "1978, 3742000, 15682, 499, 954,, 3708, 10521 "
[3] "1979, 3769000, 15578, 496, 1037,, 4127, 9918 "
[4] "1980, 3861466, 17320, 509, 1158,, 5102, 10551 "
[5] "1981, 3916000, 18423, 465, 1021,, 4952, 11985 "
[6] "1982, 3943000, 17653, 417, 1026,, 4417, 11793 "
Nice! Now we just need our colnames. Recall that we saved this information.
[1] "\nYear,Population,Violent crime total,Murder and nonnegligent Manslaughter,Legacy rape /1,Revised rape /2,Robbery,Aggravated assault,"
colnames(crime_data_sep) <-c("Year",
"Population",
"Violent_crime_total",
"Murder_and_nonnegligent_Manslaughter",
"Legacy_rape" ,
"Revised_rape",
"Robbery",
"Aggravated_assault")
head(crime_data_sep)# A tibble: 6 x 8
Year Population Violent_crime_t… Murder_and_nonn… Legacy_rape Revised_rape
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1977 3690000 15293 524 929 NA
2 1978 3742000 15682 499 954 NA
3 1979 3769000 15578 496 1037 NA
4 1980 3861466 17320 509 1158 NA
5 1981 3916000 18423 465 1021 NA
6 1982 3943000 17653 417 1026 NA
# … with 2 more variables: Robbery <dbl>, Aggravated_assault <dbl>
We also want to combine this with the state information we collected earlier. We will use the bind_cols() function of the dplyr package to do this. This requires that the data have the same number of rows.
Now we will rename the Viol_crime_count variable to be Variable and we will remove all of the other columns except for Year. We will also rename the variables to look like the other datasets.
crime_data <- crime_data_sep %>%
mutate(VARIABLE = "Viol_crime_count") %>%
rename("VALUE" = Violent_crime_total) %>%
rename("YEAR" = Year) %>%
select(YEAR,STATE, VARIABLE, VALUE)
crime_data# A tibble: 1,938 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Viol_crime_count 15293
2 1978 Alabama Viol_crime_count 15682
3 1979 Alabama Viol_crime_count 15578
4 1980 Alabama Viol_crime_count 17320
5 1981 Alabama Viol_crime_count 18423
6 1982 Alabama Viol_crime_count 17653
7 1983 Alabama Viol_crime_count 16471
8 1984 Alabama Viol_crime_count 17204
9 1985 Alabama Viol_crime_count 18398
10 1986 Alabama Viol_crime_count 22616
# … with 1,928 more rows
from Michael:
crime_data2 <- data.frame(cbind(crime_data2, rep(1:(length(crime_data2)/rep_cycle_cut),each=rep_cycle_cut)))
colnames(crime_data) <- c("String","STATE_GROUP")
crime_data <- crime_data %>%
group_by(STATE_GROUP) %>%
group_split()
columns_crime_data <- 8
crime_data <- crime_data %>%
map(~mutate(.,
State = case_when(str_detect(String, "Estimated crime in ") ~ substring(String, nchar("Estimated crime in ")+1)),
row_id = row_number())) %>%
map(~fill(., State)) %>%
map(~filter(.,row_id > 2)) %>%
map(~mutate(.,
String = paste0(String, ",", State))) %>%
map(~dplyr::select(.,String)) %>%
map(~str_split_fixed(.$String,",",columns_crime_data + 1)) %>%
map(~data.frame(.)) %>%
map(~rename(.,"YEAR"=X1,
"Extra_col1"=X2,
"VC"=X3,
"Extra_col2"=X4,
"Extra_col3"=X5,
"Extra_col4"=X6,
"Extra_col5"=X7,
"Extra_col6"=X8,
"STATE"=X9)) %>%
map(~dplyr::select(.,-contains("Extra_col"))) %>%
map(~.x %>% mutate_all(~trimws(.,which = "both"))) %>%
map_df(bind_rows)
sapply(crime_data, class)
crime_data <- crime_data %>%
mutate(VARIABLE = "Viol_crime_count") %>%
rename("VALUE" = VC) %>%
as.tibble() %>%
mutate(YEAR = as.numeric(YEAR),
VALUE = as.numeric(VALUE)) Click here to see details about how the RTC Law data was wrangled
The information about the laws for each state are located on page 62 of the Donohue, et al. article, so first we will select just this page. We can print part of the character string for this page using the utils str() function and the ncar.max argument.
chr " Table A1: RTC Adoption Dates\n State Effective Date of RTC Law Fraction of Year In Effect Year of Passage RTC Date (Synthetic Controls Analysis)\n Alabama 1975 1975\n Alaska 10/1/1994 0.252 1995\n Arizona 7/17/1994 0.460 1995\n Arkansas 7/27/1995 0.433 1996\n California N/A 0\n Colorado 5/17/2003 0.627 2003\n Connecticut 1970 1970\n "| __truncated__
We can also use the cat function to see the data printed nicely to see what we are going for.
Table A1: RTC Adoption Dates
State Effective Date of RTC Law Fraction of Year In Effect Year of Passage RTC Date (Synthetic Controls Analysis)
Alabama 1975 1975
Alaska 10/1/1994 0.252 1995
Arizona 7/17/1994 0.460 1995
Arkansas 7/27/1995 0.433 1996
California N/A 0
Colorado 5/17/2003 0.627 2003
Connecticut 1970 1970
Delaware N/A 0
District of Columbia N/A 0
Florida 10/1/1987 0.252 1988
Georgia 8/25/1989 0.353 1990
Hawaii N/A 0
Idaho 7/1/1990 0.504 1990
Illinois 1/5/2014 2014
Indiana 1/15/1980 0.962 1980
Iowa 1/1/2011 1.000 2011
Kansas 1/1/2007 1.000 2007
Kentucky 10/1/1996 0.251 1997
Louisiana 4/19/1996 0.702 1996
Maine 9/19/1985 0.285 1986
Maryland N/A 0
Massachusetts N/A 0
Michigan 7/1/2001 0.504 2001
Minnesota 5/28/2003 0.597 2003
Mississippi 7/1/1990 0.504 1990
Missouri 2/26/2004 0.847 2004
Montana 10/1/1991 0.252 1992
Nebraska 1/1/2007 1.000 2007
Nevada 10/1/1995 0.252 1996
New Hampshire 1959 1959
New Jersey N/A 0
New Mexico 1/1/2004 1.000 2004
New York N/A 0
North Carolina 12/1/1995 0.085 1996
North Dakota 8/1/1985 0.419 1986
Ohio 4/8/2004 0.732 2004
Oklahoma 1/1/1996 1.000 1996
Oregon 1/1/1990 1.000 1990
Pennsylvania 6/17/1989 0.542 1989
Philadelphia 10/11/1995 0.225 1996
Rhode Island N/A 0
South Carolina 8/23/1996 0.358 1997
South Dakota 7/1/1985 0.504 1985
Tennessee 10/1/1996 0.251 1997
Texas 1/1/1996 1.000 1996
Utah 5/1/1995 0.671 1995
Vermont 1970 1970
Virginia 5/5/1995 0.660 1995
Washington 1961 1961
West Virginia 7/7/1989 0.488 1990
Wisconsin 11/1/2011 0.167 2012
Wyoming 10/1/1994 0.252 1995
60
We can see that this is one continuous character string. We can sparate into lines based on the presence of "\n" in the string using the str_split() function of the stringr package. We need to unlist the data first, as the output of str_split() is a list. Finally, we can convert it to a tibble using the as.tibble() function of the tibble package. We also see that we don’t need the first line about the table. We can remove this with the slice() function of the dplyr package. We can also use this to remove the colnames so that we can replace them. Thus we will use slice(-(1:2)) to remove the first two lines.
So we will split and unlist() the data.
p_62 <- DAWpaper_p_62 %>%
str_split("\n") %>%
unlist() %>%
dplyr::as_tibble() %>%
slice(-(1:2))
head(p_62)# A tibble: 6 x 1
value
<chr>
1 " Alabama 1975 …
2 " Alaska 10/1/1994 0.252 …
3 " Arizona 7/17/1994 0.460 …
4 " Arkansas 7/27/1995 0.433 …
5 " California N/A …
6 " Colorado 5/17/2003 0.627 …
# A tibble: 6 x 1
value
<chr>
1 " Washington 1961 …
2 " West Virginia 7/7/1989 0.488 …
3 " Wisconsin 11/1/2011 0.167 …
4 " Wyoming 10/1/1994 0.252 …
5 " 60"
6 ""
We also see by looking at the tail that we want to remove the last two lines. One is empty and the other has only 63 characters, which is the line with the page number.
# A tibble: 6 x 1
RTC
<chr>
1 109
2 109
3 109
4 109
5 63
6 0
# A tibble: 1 x 1
RTC
<chr>
1 " 60"
# A tibble: 1 x 1
RTC
<chr>
1 ""
Now we will try splitting by spaces. We can show the output withe the first() and nth() functions of the dplyr package.
[[1]]
[1] "" "" "" "" "" "" "Alabama"
[8] "" "" "" "" "" "" ""
[15] "" "" "" "" "" "" ""
[22] "" "" "" "" "" "1975" ""
[29] "" "" "" "" "" "" ""
[36] "" "" "" "" "" "" ""
[43] "" "" "" "" "" "" ""
[50] "" "" "" "" "" "" ""
[57] "" "" "" "" "" "" ""
[64] "" "" "" "" "" "" ""
[71] "" "" "" "" "" "" ""
[78] "" "" "" "" "" "" ""
[85] "" "" "" "" "" "" ""
[92] "" "" "" "1975"
[[1]]
[1] "" "" "" "" ""
[6] "California" "" "" "" ""
[11] "" "" "" "" ""
[16] "" "" "" "" ""
[21] "" "" "" "N/A" ""
[26] "" "" "" "" ""
[31] "" "" "" "" ""
[36] "" "" "" "" ""
[41] "" "" "" "" ""
[46] "" "" "" "" ""
[51] "" "" "" "" ""
[56] "" "" "" "" ""
[61] "" "" "" "" ""
[66] "" "" "" "" ""
[71] "" "" "" "" ""
[76] "" "" "" "" ""
[81] "" "" "" "" ""
[86] "" "" "" "" ""
[91] "" "" "" "" ""
[96] "0"
Interesting, we can see that there are lots of spaces between the elements of the table and that they vary by line. For example there are 6 spaces before Alabama and 7 spaces before Alaska.
Overall, that didn’t work quite like we expected.
Recall from the cheatsheet that "\\s" indicates a space. There are also ways to specify how many spaces using curly brackets{}.
The spacing appears to vary quite a bit. WE can use the str_count() function of the stringr package to see how often we have white spaces larger than 5, 10, 15, or 40 spaces.
# how often are there white spaces with more than 5 spaces
p_62 %>%
pull(RTC) %>%
map(str_count, pattern = "\\s{5,}") %>%
unlist() [1] 3 4 4 4 3 4 2 3 2 4 4 3 4 3 4 4 4 4 4 4 3 2 4 4 4 4 4 4 4 2 3 3 3 3 3 4 4 4
[39] 3 3 2 3 3 4 4 4 3 4 2 3 4 4
# how often are there white spaces with more than 10 spaces
p_62 %>%
pull(RTC) %>%
map(str_count, pattern = "\\s{10,}") %>%
unlist() [1] 2 3 3 3 2 3 2 2 2 3 3 2 3 2 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 2 2 3 2 3 3 3 3 3
[39] 3 3 2 3 3 3 3 3 2 3 2 3 3 3
# how often are there white spaces with more than 15 spaces
p_62 %>%
pull(RTC) %>%
map(str_count, pattern = "\\s{15,}") %>%
unlist() [1] 2 3 3 3 2 3 2 2 1 3 3 2 3 2 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 2 2 3 2 2 3 3 3 3
[39] 3 2 2 2 3 3 3 3 2 3 2 3 3 3
# how often are there white spaces with more than 40 spaces
p_62 %>%
pull(RTC) %>%
map(str_count, pattern = "\\s{40,}") %>%
unlist() [1] 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0
[39] 0 0 1 0 0 0 0 0 1 0 1 0 0 0
Rows with white spaces with more than 40 consecutive spaces is less common. It appears to be the case in the 1st and 5th row.
If we take a look at those rows we can see that this occurs when we have a missing value.
Table A1: RTC Adoption Dates
State Effective Date of RTC Law Fraction of Year In Effect Year of Passage RTC Date (Synthetic Controls Analysis)
Alabama 1975 1975
Alaska 10/1/1994 0.252 1995
Arizona 7/17/1994 0.460 1995
Arkansas 7/27/1995 0.433 1996
California N/A 0
Colorado 5/17/2003 0.627 2003
Connecticut 1970 1970
Delaware N/A 0
District of Columbia N/A 0
Florida 10/1/1987 0.252 1988
Georgia 8/25/1989 0.353 1990
Hawaii N/A 0
Idaho 7/1/1990 0.504 1990
Illinois 1/5/2014 2014
Indiana 1/15/1980 0.962 1980
Iowa 1/1/2011 1.000 2011
Kansas 1/1/2007 1.000 2007
Kentucky 10/1/1996 0.251 1997
Louisiana 4/19/1996 0.702 1996
Maine 9/19/1985 0.285 1986
Maryland N/A 0
Massachusetts N/A 0
Michigan 7/1/2001 0.504 2001
Minnesota 5/28/2003 0.597 2003
Mississippi 7/1/1990 0.504 1990
Missouri 2/26/2004 0.847 2004
Montana 10/1/1991 0.252 1992
Nebraska 1/1/2007 1.000 2007
Nevada 10/1/1995 0.252 1996
New Hampshire 1959 1959
New Jersey N/A 0
New Mexico 1/1/2004 1.000 2004
New York N/A 0
North Carolina 12/1/1995 0.085 1996
North Dakota 8/1/1985 0.419 1986
Ohio 4/8/2004 0.732 2004
Oklahoma 1/1/1996 1.000 1996
Oregon 1/1/1990 1.000 1990
Pennsylvania 6/17/1989 0.542 1989
Philadelphia 10/11/1995 0.225 1996
Rhode Island N/A 0
South Carolina 8/23/1996 0.358 1997
South Dakota 7/1/1985 0.504 1985
Tennessee 10/1/1996 0.251 1997
Texas 1/1/1996 1.000 1996
Utah 5/1/1995 0.671 1995
Vermont 1970 1970
Virginia 5/5/1995 0.660 1995
Washington 1961 1961
West Virginia 7/7/1989 0.488 1990
Wisconsin 11/1/2011 0.167 2012
Wyoming 10/1/1994 0.252 1995
60
So we will replace white spaces with more than 40 consecutive spaces with NA. Let’s also remove the leading white spaces that varies infront of the state names, as DC does not have any and this could cause a problem later. We will also replace any white spaces of 2 consecutive spaces or more , but less than 15 white spaces with “|” so that we can split the data based on this symbol. Thus we will also put these around the NA value that we are using replace the white spaces made of 40+ spaces.
p_62b <-p_62 %>%
mutate(RTC = str_replace_all(pull(., RTC), "\\s{40,}", "|N/A|")) %>%
mutate(RTC =str_trim(pull(., RTC), side = "left")) %>%
mutate(RTC = str_replace_all(pull(., RTC), "\\s{2,15}", "|"))
head(p_62b)# A tibble: 6 x 1
RTC
<chr>
1 Alabama||1975|N/A|1975
2 Alaska||10/1/1994||0.252|||1995
3 Arizona| 7/17/1994||0.460|||1995
4 Arkansas| 7/27/1995||0.433|||1996
5 California||N/A|N/A|0
6 Colorado| 5/17/2003||0.627|||2003
Now anytime there is one or more "|" we should have a column break. So now we will split the data by this symbol.
[[1]]
[1] "Alabama" "1975" "N/A" "1975"
[[2]]
[1] "Alaska" "10/1/1994" "0.252" "1995"
[[3]]
[1] "Arizona" " 7/17/1994" "0.460" "1995"
[[4]]
[1] "Arkansas" " 7/27/1995" "0.433" "1996"
[[5]]
[1] "California" "N/A" "N/A" "0"
[[6]]
[1] "Colorado" " 5/17/2003" "0.627" "2003"
Great! Now we want to put our data in tibble format. To do so we need to bind the rows together. We can do so using the base rbind() function. We will use this instead of the bind_rows() function of dplyr because rbind() is less restrictive and allows for columns without names. We will use the base do.call() function, so that this is performed along each character string within the list of p_62b while maintaining the structure. Then we create a tibble out of this. avocado… describe do.call better… maybe do something tidyverse…
p_62 <- as.tibble(do.call(rbind, p_62b))
colnames(p_62) <- c("STATE",
"E_Date_RTC",
"Frac_Yr_Eff_Yr_Pass",
"RTC_Date_SA")
p_62 <- p_62 %>%
dplyr::select(STATE, RTC_Date_SA) %>%
rename("RTC_LAW_YEAR"= RTC_Date_SA) %>%
mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) %>%
mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
TRUE ~ RTC_LAW_YEAR))
RTC <-p_62
RTC# A tibble: 52 x 2
STATE RTC_LAW_YEAR
<chr> <dbl>
1 Alabama 1975
2 Alaska 1995
3 Arizona 1995
4 Arkansas 1996
5 California Inf
6 Colorado 2003
7 Connecticut 1970
8 Delaware Inf
9 District of Columbia Inf
10 Florida 1988
# … with 42 more rows
avocado why inf?
from Michael
p_62b <-pull(p_62b, RTC) %>%
str_split( "\\|{1,}") %>% as.data.frame()
unlist()
p_62b <-p_62 %>%
mutate(RTC = str_replace_all(pull(., RTC), "\\s{40,}", "|N/A|")) %>%
mutate(RTC = str_replace_all(pull(., RTC), "\\s{2,15}", "|"))
p_62b %<>%
mutate(value = str_split(pull(., value), "\\|{1}"))
read_csv(p_62b$value)
p_62 <- p_62 %>%
apply(1,str_replace_all, "\\s{40,}", "|N/A|") %>%
str_replace_all("\\s{2,15}", "|") %>%
as.data.frame()
p_62 <- sapply(p_62$., str_split, "\\|{1,}")
sapply(p_62, nchar)
p_62 <- lapply(p_62, function(x) x[nchar(x) > 0])
p_62 <- as.data.frame(do.call(rbind, p_62))
rownames(p_62)
rownames(p_62) <- c()
colnames(p_62) <- c("STATE",
"E_Date_RTC",
"Frac_Yr_Eff_Yr_Pass",
"RTC_Date_SA")
sapply(p_62, class)
p_62 <- p_62 %>%
dplyr::select(STATE, RTC_Date_SA) %>%
rename("RTC_LAW_YEAR"=RTC_Date_SA) %>%
mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) %>%
mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
TRUE ~ RTC_LAW_YEAR))
sapply(p_62, class)
head(p_62)Now we will join the data from the different data sets together to create a tibble of data for an analysis that will be similar to the data used by Donohue et al. {target="_blank"} and Lott and Mustard.
First we need to check that our data is indeed ready to be joined. We need to make sure that the column names are the same for each dataset that we intend to combine together.
We will use the compare_df_cols() and compare_df_cols_same() functions of the janitor package, to ensure that the column names are the same and that the column values are the same type so that the tibbles can be joined by row.
If they can be joined by row, then compare_df_cols_same() returns the value TRUE, while compare_df_cols(), provides a description of the columns.
library(janitor)
data_list <- list(dem_DONOHUE,
dem_LOTT,
population_data,
ue_rate_data,
poverty_rate_data,
crime_data,
ps_data) #police staffing
janitor::compare_df_cols_same(data_list)[1] TRUE
column_name data_list_1 data_list_2 data_list_3 data_list_4 data_list_5
1 STATE character character character character character
2 VALUE numeric numeric numeric numeric numeric
3 VARIABLE character character character character character
4 YEAR numeric numeric numeric numeric numeric
data_list_6 data_list_7
1 character character
2 numeric numeric
3 character character
4 numeric numeric
checkstate <- function(x) { x %<>% distinct(STATE) %>% tally() %>% pull(n) }
map(data_list, checkstate)[[1]]
[1] 51
[[2]]
[1] 51
[[3]]
[1] 51
[[4]]
[1] 51
[[5]]
[1] 51
[[6]]
[1] 51
[[7]]
[1] 51
checkyear <- function(x) { x %<>% distinct(YEAR) %>% tally() %>% pull(n) }
map(data_list, checkyear)[[1]]
[1] 34
[[2]]
[1] 34
[[3]]
[1] 34
[[4]]
[1] 44
[[5]]
[1] 39
[[6]]
[1] 38
[[7]]
[1] 38
Avocado bind_rows will make na values for the later years beyond 2010 for the datasets that have them… not sure why we are adding these… maybe there is more about that later?
We will now bind the demogrpahic data that we made for the Donohue-like analysis called dem_DONOHUE, as well as all the other datasets that we have wrangled. This is possible because we have the same colnames for each dataset. We will also use the pivot_wider() function of the tidyr package to change the shape of the data. This will make the data have more columns. Each unique value in the column called VARIABLE will be used to make new columns. and the values for each will come from the column called VALUE.
DONOHUE_DF <- bind_rows(dem_DONOHUE,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data)
head(DONOHUE_DF)# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
DONOHUE_DF %<>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE")
DONOHUE_DF %>%
slice_sample(n = 10) %>%
glimpse()Rows: 10
Columns: 13
$ YEAR <dbl> 1995, 2003, 2015, 2000, 1997, 2012, 1988, 1…
$ STATE <chr> "California", "Massachusetts", "Oregon", "O…
$ Black_Male_15_to_19_years <dbl> 0.31138464, 0.31299021, NA, 0.37549593, 0.6…
$ Black_Male_20_to_39_years <dbl> 1.36817330, 1.05512361, NA, 1.21101852, 2.1…
$ Other_Male_15_to_19_years <dbl> 0.45350592, 0.25250036, NA, 0.68999657, 0.1…
$ Other_Male_20_to_39_years <dbl> 2.1290345, 1.1601751, NA, 1.9453069, 0.5596…
$ White_Male_15_to_19_years <dbl> 2.691623, 2.930434, NA, 2.974787, 2.889778,…
$ White_Male_20_to_39_years <dbl> 13.58846, 11.63306, NA, 10.89523, 12.14260,…
$ Unemployment_rate <dbl> 7.9, 5.7, 5.6, 3.0, 4.8, 9.0, 6.0, 5.1, 9.3…
$ Poverty_rate <dbl> 16.7, 10.3, 11.9, 14.9, 11.2, 22.0, 12.5, 1…
$ Viol_crime_count <dbl> 305154, 30377, NA, 17177, 102476, 7769, 234…
$ Population <dbl> 31493525, 6422565, NA, 3454365, 12011509, N…
$ police_per_100k_lag <dbl> 293.3968, 316.2755, NA, 303.9922, 329.2592,…
We will also add the Right to Carry Law data using the left_join() function of the dplyr package. Which will place the DONOHUE_DF data on the left of the RTC data. Values will be matched by STATE. Then we will create a new variable called RTC_LAW using the mutate() function and the case_when() function of the dplyr package that will have the value TRUE if the current year data is equal to or greater than the year that a more premisive RTC law was adopted, otherwise the value will be FALSE.
# A tibble: 6 x 2
STATE RTC_LAW_YEAR
<chr> <dbl>
1 Alabama 1975
2 Alaska 1995
3 Arizona 1995
4 Arkansas 1996
5 California Inf
6 Colorado 2003
DONOHUE_DF %<>%
left_join(RTC , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
DONOHUE_DF %>%
slice_sample(n = 10) %>%
glimpse()Rows: 10
Columns: 15
$ YEAR <dbl> 1988, 2015, 1994, 1986, 1984, 2000, 2004, 2…
$ STATE <chr> "Iowa", "Maryland", "Michigan", "Alabama", …
$ Black_Male_15_to_19_years <dbl> 0.09059483, NA, 0.60273478, 1.26148927, 0.4…
$ Black_Male_20_to_39_years <dbl> 0.2938914, NA, 2.1038176, 3.6530402, 1.5079…
$ Other_Male_15_to_19_years <dbl> 0.06148022, NA, 0.08476202, 0.03605114, 0.0…
$ Other_Male_20_to_39_years <dbl> 0.2328085, NA, 0.3460699, 0.1415490, 0.1527…
$ White_Male_15_to_19_years <dbl> 3.664647, NA, 2.918864, 2.887824, 3.581117,…
$ White_Male_20_to_39_years <dbl> 14.906461, NA, 12.874229, 12.087804, 14.286…
$ Unemployment_rate <dbl> 4.5, 5.1, 6.0, 9.7, 9.5, 4.9, 5.1, 4.7, 7.9…
$ Poverty_rate <dbl> 9.4, 9.6, 14.1, 23.8, 13.5, 17.5, 9.3, 14.8…
$ Viol_crime_count <dbl> 7279, NA, 72751, 22616, 41430, 13786, 29489…
$ Population <dbl> 2768370, NA, 9584481, 3991552, 10737767, 18…
$ police_per_100k_lag <dbl> 205.7167, NA, 272.4300, 260.5753, 216.5348,…
$ RTC_LAW_YEAR <dbl> 2011, Inf, 2001, 1975, 2004, 2004, Inf, 199…
$ RTC_LAW <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FA…
Since we have differing numbers of years for each data set, we can use the drop_na() function of the tidyr package. to remove years that have incomplete data. Thus any row with NA values will be removed.
For example, we can see that for 1977, althouth we have most of the data, we do not have the poverty rate.
Rows: 6
Columns: 15
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977
$ STATE <chr> "Alabama", "Alaska", "Arizona", "Arkansas",…
$ Black_Male_15_to_19_years <dbl> 1.5457212, 0.1631338, 0.1804065, 1.0106946,…
$ Black_Male_20_to_39_years <dbl> 3.0379602, 0.9684809, 0.4796284, 1.8185525,…
$ Other_Male_15_to_19_years <dbl> 0.01781857, 1.11902724, 0.40213472, 0.03058…
$ Other_Male_20_to_39_years <dbl> 0.06421558, 2.72594532, 0.89144464, 0.09260…
$ White_Male_15_to_19_years <dbl> 3.578069, 3.828357, 4.391965, 3.879766, 4.1…
$ White_Male_20_to_39_years <dbl> 11.08537, 17.91501, 14.07430, 11.70744, 14.…
$ Unemployment_rate <dbl> 7.3, 9.9, 8.2, 6.5, 8.3, 6.4
$ Poverty_rate <dbl> NA, NA, NA, NA, NA, NA
$ Viol_crime_count <dbl> 15293, 1804, 11347, 6924, 154582, 13407
$ Population <dbl> 3782571, 397220, 2427296, 2207195, 22350332…
$ police_per_100k_lag <dbl> 195.1054, 136.9518, 264.2447, 151.5045, 293…
$ RTC_LAW_YEAR <dbl> 1975, 1995, 1995, 1996, Inf, 2003
$ RTC_LAW <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE
Another example, in 2018 we only have infromation about unemployment rates, poverty rates, and RTC laws.
Rows: 6
Columns: 15
$ YEAR <dbl> 2018, 2018, 2018, 2018, 2018, 2018
$ STATE <chr> "Alabama", "Alaska", "Arizona", "Arkansas",…
$ Black_Male_15_to_19_years <dbl> NA, NA, NA, NA, NA, NA
$ Black_Male_20_to_39_years <dbl> NA, NA, NA, NA, NA, NA
$ Other_Male_15_to_19_years <dbl> NA, NA, NA, NA, NA, NA
$ Other_Male_20_to_39_years <dbl> NA, NA, NA, NA, NA, NA
$ White_Male_15_to_19_years <dbl> NA, NA, NA, NA, NA, NA
$ White_Male_20_to_39_years <dbl> NA, NA, NA, NA, NA, NA
$ Unemployment_rate <dbl> 3.9, 6.5, 4.7, 3.6, 4.3, 3.2
$ Poverty_rate <dbl> 16.0, 13.1, 12.8, 15.9, 11.9, 9.1
$ Viol_crime_count <dbl> NA, NA, NA, NA, NA, NA
$ Population <dbl> NA, NA, NA, NA, NA, NA
$ police_per_100k_lag <dbl> NA, NA, NA, NA, NA, NA
$ RTC_LAW_YEAR <dbl> 1975, 1995, 1995, 1996, Inf, 2003
$ RTC_LAW <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE
Rows: 6
Columns: 15
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980
$ STATE <chr> "Alabama", "Alaska", "Arizona", "Arkansas",…
$ Black_Male_15_to_19_years <dbl> 1.4567383, 0.1670456, 0.1747544, 0.9545139,…
$ Black_Male_20_to_39_years <dbl> 3.3613348, 0.9933775, 0.5267121, 1.9738213,…
$ Other_Male_15_to_19_years <dbl> 0.02128385, 1.12978156, 0.41504620, 0.03849…
$ Other_Male_20_to_39_years <dbl> 0.08608419, 2.96332905, 0.98492602, 0.12425…
$ White_Male_15_to_19_years <dbl> 3.398210, 3.627805, 4.091577, 3.740199, 3.8…
$ White_Male_20_to_39_years <dbl> 11.57164, 18.28852, 14.69238, 12.12513, 14.…
$ Unemployment_rate <dbl> 8.9, 9.6, 6.6, 7.6, 6.8, 5.8
$ Poverty_rate <dbl> 21.2, 9.6, 12.8, 21.5, 11.0, 8.6
$ Viol_crime_count <dbl> 17320, 1919, 17673, 7656, 210290, 15215
$ Population <dbl> 3899671, 404680, 2735840, 2288809, 23792840…
$ police_per_100k_lag <dbl> 201.3247, 194.7218, 262.6616, 152.0005, 243…
$ RTC_LAW_YEAR <dbl> 1975, 1995, 1995, 1996, Inf, 2003
$ RTC_LAW <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE
Rows: 6
Columns: 15
$ YEAR <dbl> 2010, 2010, 2010, 2010, 2010, 2010
$ STATE <chr> "Vermont", "Virginia", "Washington", "West …
$ Black_Male_15_to_19_years <dbl> 0.06725669, 0.83162848, 0.15964128, 0.16343…
$ Black_Male_20_to_39_years <dbl> 0.2056042, 2.7300368, 0.6653574, 0.6164599,…
$ Other_Male_15_to_19_years <dbl> 0.1702984, 0.3274050, 0.5979246, 0.1050716,…
$ Other_Male_20_to_39_years <dbl> 0.4754297, 1.3483385, 2.1186016, 0.3126259,…
$ White_Male_15_to_19_years <dbl> 3.531216, 2.338080, 2.755299, 3.050422, 2.9…
$ White_Male_20_to_39_years <dbl> 11.363026, 9.776093, 11.195603, 11.548874, …
$ Unemployment_rate <dbl> 6.1, 7.1, 10.0, 8.7, 8.7, 6.4
$ Poverty_rate <dbl> 10.8, 10.7, 11.6, 16.8, 10.1, 9.6
$ Viol_crime_count <dbl> 820, 17184, 21138, 5586, 14167, 1117
$ Population <dbl> 625960, 8024617, 6744496, 1853973, 5691047,…
$ police_per_100k_lag <dbl> 264.2341, 302.9677, 265.1495, 272.0644, 344…
$ RTC_LAW_YEAR <dbl> 1970, 1995, 1961, 1990, 2012, 1995
$ RTC_LAW <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE
Now we have complete data and the data spans from 1980 to 2010.
[1] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
[16] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
[31] 2010
If we include states that had a RTC law adopted before our time span of data, say 1975, then we only have information about crime rates and the other variables of interest after the law was adopted but not before, therefore including these states doesn’t really makes sense. Thus, we will drop the data for these states. We can use the set_diff() function of the dplyr package to see what states are in the population_data that contains all the original 51 states (recall this includes the District of Columbia) but are not in the DONOHUE_DF. The order matters here. If we did it the other way around with population_data listed second, then set_diff would test if there are any states in Donohue_DF that are not in population_data. As there are none this would result in nothing.
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)
DONOHUE_DF %<>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
# DONOHUE_DF %<>%
# mutate(STATE = as.factor(STATE))
#
# DONOHUE_DF %>%
# pull(STATE) %>%
# levels()
setdiff(distinct(population_data, STATE),
distinct(DONOHUE_DF, STATE))# A tibble: 6 x 1
STATE
<chr>
1 Alabama
2 Connecticut
3 Indiana
4 New Hampshire
5 Vermont
6 Washington
We will also calculate a violent crime rate relative to the population in that state at that time, now that we have data for both crime count and population. Will will also calcuate the log value of this rate and the population.
DONOHUE_DF %<>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))From michael:
DONOHUE_DF %>%
count() %>%
filter(n != 51) %>%
print(n=dim(.)[1])
summary(as.factor(DONOHUE_DF$STATE))
max(DONOHUE_DF$YEAR) - min(DONOHUE_DF$YEAR) + 1
DONOHUE_DF <- DONOHUE_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(DONOHUE_DF$STATE))
length(levels(DONOHUE_DF$STATE))
DONOHUE_DF <- DONOHUE_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(DONOHUE_DF$STATE))
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
DONOHUE_DF <- DONOHUE_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
DONOHUE_DF <- DONOHUE_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(DONOHUE_DF$STATE)))
length(summary(droplevels(as.factor(DONOHUE_DF$STATE))))We will now bind the demogrpahic data that we made for the Lott-like analysis called dem_Lott, as well as all the other datasets that we have wrangled just as we did for the Donohue-like analysis. Again, this is possible because we have the same colnames for each dataset.
LOTT_DF <- bind_rows(dem_LOTT,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(RTC , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE)) %>%
drop_na()
baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)
LOTT_DF %<>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
setdiff(distinct(population_data, STATE),
distinct(LOTT_DF, STATE))# A tibble: 6 x 1
STATE
<chr>
1 Alabama
2 Connecticut
3 Indiana
4 New Hampshire
5 Vermont
6 Washington
LOTT_DF %<>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))Let’s see how the data compares:
We will check the dimensions of each using the base dim() function
[1] 1395 50
[1] 1395 20
As expected the Lott_DF is 30 columns larger, due to the 30 additional demographic variables. We can check those now as well.
[1] "YEAR" "STATE"
[3] "Black_Female_10_to_19_years" "Black_Female_20_to_29_years"
[5] "Black_Female_30_to_39_years" "Black_Female_40_to_49_years"
[7] "Black_Female_50_to_64_years" "Black_Female_65_years_and_over"
[9] "Black_Male_10_to_19_years" "Black_Male_20_to_29_years"
[11] "Black_Male_30_to_39_years" "Black_Male_40_to_49_years"
[13] "Black_Male_50_to_64_years" "Black_Male_65_years_and_over"
[15] "Other_Female_10_to_19_years" "Other_Female_20_to_29_years"
[17] "Other_Female_30_to_39_years" "Other_Female_40_to_49_years"
[19] "Other_Female_50_to_64_years" "Other_Female_65_years_and_over"
[21] "Other_Male_10_to_19_years" "Other_Male_20_to_29_years"
[23] "Other_Male_30_to_39_years" "Other_Male_40_to_49_years"
[25] "Other_Male_50_to_64_years" "Other_Male_65_years_and_over"
[27] "White_Female_10_to_19_years" "White_Female_20_to_29_years"
[29] "White_Female_30_to_39_years" "White_Female_40_to_49_years"
[31] "White_Female_50_to_64_years" "White_Female_65_years_and_over"
[33] "White_Male_10_to_19_years" "White_Male_20_to_29_years"
[35] "White_Male_30_to_39_years" "White_Male_40_to_49_years"
[37] "White_Male_50_to_64_years" "White_Male_65_years_and_over"
[39] "Unemployment_rate" "Poverty_rate"
[41] "Viol_crime_count" "Population"
[43] "police_per_100k_lag" "RTC_LAW_YEAR"
[45] "RTC_LAW" "TIME_0"
[47] "TIME_INF" "Viol_crime_rate_1k"
[49] "Viol_crime_rate_1k_log" "Population_log"
[1] "YEAR" "STATE"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[9] "Unemployment_rate" "Poverty_rate"
[11] "Viol_crime_count" "Population"
[13] "police_per_100k_lag" "RTC_LAW_YEAR"
[15] "RTC_LAW" "TIME_0"
[17] "TIME_INF" "Viol_crime_rate_1k"
[19] "Viol_crime_rate_1k_log" "Population_log"
Lastly, we will check that the YEAR values are the same. We can use the setequal() function of the dplyr package to see if the values are the same.
[1] TRUE
Looks as expected! we have
from Michael:
LOTT_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])
summary(as.factor(LOTT_DF$STATE))
max(LOTT_DF$YEAR) - min(LOTT_DF$YEAR) + 1
LOTT_DF <- LOTT_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia" = c("District of Columbia","D.C.")))
summary(as.factor(LOTT_DF$STATE))
length(levels(LOTT_DF$STATE))
LOTT_DF <- LOTT_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(LOTT_DF$STATE))
baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
LOTT_DF <- LOTT_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
LOTT_DF <- LOTT_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(LOTT_DF$STATE)))
length(summary(droplevels(as.factor(LOTT_DF$STATE))))Let’s do some quick visualizations of the crime data across different states and over time.
DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log, color = STATE)) +
geom_point(size = 0.5) +
geom_line(aes(group=STATE),
size = 0.5,
show.legend = FALSE) +
geom_text_repel(data = DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
filter(YEAR == last(YEAR)),
aes(label = STATE,
x = YEAR,
y = Viol_crime_rate_100k_log),
size = 3,
alpha = 1,
nudge_x = 10,
direction = "y",
hjust = 1,
vjust = 1,
segment.size = 0.25,
segment.alpha = 0.25,
force = 1,
max.iter = 9999) +
guides(color = FALSE) +
scale_x_continuous(breaks = seq(1980, 2015, by = 1),
limits = c(1980, 2015),
labels = c(seq(1980, 2010, by = 1), rep("", 5))) +
scale_y_continuous(breaks = seq(3.5, 8.5, by = 0.5),
limits = c(3.5, 8.5)) +
labs(title = "States have different levels of crime",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))DONOHUE_DF %>%
group_by(YEAR) %>%
summarise(Viol_crime_count = sum(Viol_crime_count),
Population = sum(Population),
.groups = "drop") %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log)) +
geom_line() +
scale_x_continuous(breaks = seq(1980, 2010, by = 1),
limits = c(1980, 2010),
labels = c(seq(1980, 2010, by = 1))) +
scale_y_continuous(breaks = seq(5.75, 6.75, by = 0.25),
limits = c(5.75, 6.75)) +
labs(title = "Crime rates fluctuate over time",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) We can also use the
vis_miss() function of the naniar package confirm that there are no missing values.
Looks like no missing data!
Same for the
LOTT_DF.
We can use the skim() of the skimr package to get a better sense of the data. This shows missingingness as well as standard deviations, spread, and means for our data. Also notice that there is a histogram of each variable.
| Name | DONOHUE_DF |
| Number of rows | 1395 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| STATE | 0 | 1 | 4 | 20 | 0 | 45 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| RTC_LAW | 0 | 1 | 0.37 | FAL: 883, TRU: 512 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| YEAR | 0 | 1 | 1995.00 | 8.95 | 1980.00 | 1987.00 | 1995.00 | 2003.00 | 2010.00 | ▇▇▇▇▇ |
| Black_Male_15_to_19_years | 0 | 1 | 0.52 | 0.51 | 0.02 | 0.13 | 0.35 | 0.73 | 3.46 | ▇▂▁▁▁ |
| Black_Male_20_to_39_years | 0 | 1 | 1.73 | 1.76 | 0.07 | 0.50 | 1.17 | 2.30 | 11.33 | ▇▂▁▁▁ |
| Other_Male_15_to_19_years | 0 | 1 | 0.26 | 0.40 | 0.01 | 0.08 | 0.14 | 0.28 | 2.90 | ▇▁▁▁▁ |
| Other_Male_20_to_39_years | 0 | 1 | 0.92 | 1.40 | 0.07 | 0.31 | 0.55 | 0.99 | 9.90 | ▇▁▁▁▁ |
| White_Male_15_to_19_years | 0 | 1 | 3.09 | 0.72 | 0.55 | 2.69 | 3.14 | 3.54 | 4.99 | ▁▁▇▇▁ |
| White_Male_20_to_39_years | 0 | 1 | 12.55 | 2.29 | 4.41 | 11.18 | 12.65 | 14.15 | 18.44 | ▁▂▇▇▂ |
| Unemployment_rate | 0 | 1 | 6.03 | 2.10 | 2.30 | 4.50 | 5.60 | 7.20 | 17.80 | ▇▇▂▁▁ |
| Poverty_rate | 0 | 1 | 13.34 | 3.84 | 5.70 | 10.40 | 12.70 | 15.50 | 27.20 | ▃▇▅▁▁ |
| Viol_crime_count | 0 | 1 | 31760.77 | 46494.46 | 322.00 | 5107.50 | 14412.00 | 38782.50 | 345624.00 | ▇▁▁▁▁ |
| Population | 0 | 1 | 5446810.81 | 6070687.33 | 404680.00 | 1363737.50 | 3504892.00 | 6411701.00 | 37349363.00 | ▇▂▁▁▁ |
| police_per_100k_lag | 0 | 1 | 316.57 | 115.59 | 83.76 | 248.53 | 301.00 | 358.31 | 1021.14 | ▅▇▁▁▁ |
| RTC_LAW_YEAR | 0 | 1 | Inf | NaN | 1985.00 | 1995.00 | 1997.00 | 2011.00 | Inf | ▇▇▂▅▂ |
| TIME_0 | 0 | 1 | 1980.00 | 0.00 | 1980.00 | 1980.00 | 1980.00 | 1980.00 | 1980.00 | ▁▁▇▁▁ |
| TIME_INF | 0 | 1 | 2010.00 | 0.00 | 2010.00 | 2010.00 | 2010.00 | 2010.00 | 2010.00 | ▁▁▇▁▁ |
| Viol_crime_rate_1k | 0 | 1 | 5.05 | 3.19 | 0.48 | 2.85 | 4.54 | 6.44 | 29.30 | ▇▃▁▁▁ |
| Viol_crime_rate_1k_log | 0 | 1 | 1.45 | 0.60 | -0.74 | 1.05 | 1.51 | 1.86 | 3.38 | ▁▂▇▅▁ |
| Population_log | 0 | 1 | 14.99 | 1.05 | 12.91 | 14.13 | 15.07 | 15.67 | 17.44 | ▃▅▇▅▂ |
| Name | LOTT_DF |
| Number of rows | 1395 |
| Number of columns | 50 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 48 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| STATE | 0 | 1 | 4 | 20 | 0 | 45 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| RTC_LAW | 0 | 1 | 0.37 | FAL: 883, TRU: 512 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| YEAR | 0 | 1 | 1995.00 | 8.95 | 1980.00 | 1987.00 | 1995.00 | 2003.00 | 2010.00 | ▇▇▇▇▇ |
| Black_Female_10_to_19_years | 0 | 1 | 1.00 | 1.02 | 0.02 | 0.23 | 0.63 | 1.44 | 6.53 | ▇▂▁▁▁ |
| Black_Female_20_to_29_years | 0 | 1 | 0.99 | 1.09 | 0.02 | 0.23 | 0.60 | 1.36 | 7.73 | ▇▂▁▁▁ |
| Black_Female_30_to_39_years | 0 | 1 | 0.91 | 1.00 | 0.01 | 0.19 | 0.57 | 1.27 | 6.11 | ▇▂▁▁▁ |
| Black_Female_40_to_49_years | 0 | 1 | 0.75 | 0.86 | 0.01 | 0.13 | 0.48 | 1.09 | 5.45 | ▇▂▁▁▁ |
| Black_Female_50_to_64_years | 0 | 1 | 0.76 | 0.96 | 0.00 | 0.12 | 0.44 | 1.06 | 6.10 | ▇▂▁▁▁ |
| Black_Female_65_years_and_over | 0 | 1 | 0.60 | 0.85 | 0.00 | 0.07 | 0.34 | 0.81 | 6.12 | ▇▁▁▁▁ |
| Black_Male_10_to_19_years | 0 | 1 | 1.02 | 1.01 | 0.03 | 0.26 | 0.67 | 1.46 | 6.32 | ▇▂▁▁▁ |
| Black_Male_20_to_29_years | 0 | 1 | 0.93 | 0.93 | 0.04 | 0.29 | 0.65 | 1.24 | 6.57 | ▇▂▁▁▁ |
| Black_Male_30_to_39_years | 0 | 1 | 0.81 | 0.84 | 0.02 | 0.23 | 0.54 | 1.08 | 5.37 | ▇▂▁▁▁ |
| Black_Male_40_to_49_years | 0 | 1 | 0.65 | 0.72 | 0.01 | 0.15 | 0.43 | 0.92 | 4.45 | ▇▂▁▁▁ |
| Black_Male_50_to_64_years | 0 | 1 | 0.62 | 0.76 | 0.00 | 0.13 | 0.38 | 0.86 | 4.79 | ▇▂▁▁▁ |
| Black_Male_65_years_and_over | 0 | 1 | 0.38 | 0.51 | 0.00 | 0.05 | 0.24 | 0.51 | 3.56 | ▇▁▁▁▁ |
| Other_Female_10_to_19_years | 0 | 1 | 0.51 | 0.77 | 0.03 | 0.15 | 0.27 | 0.55 | 5.33 | ▇▁▁▁▁ |
| Other_Female_20_to_29_years | 0 | 1 | 0.49 | 0.70 | 0.04 | 0.17 | 0.30 | 0.55 | 5.55 | ▇▁▁▁▁ |
| Other_Female_30_to_39_years | 0 | 1 | 0.47 | 0.74 | 0.04 | 0.16 | 0.28 | 0.51 | 5.36 | ▇▁▁▁▁ |
| Other_Female_40_to_49_years | 0 | 1 | 0.38 | 0.69 | 0.02 | 0.11 | 0.21 | 0.38 | 5.46 | ▇▁▁▁▁ |
| Other_Female_50_to_64_years | 0 | 1 | 0.38 | 0.83 | 0.02 | 0.09 | 0.18 | 0.35 | 7.10 | ▇▁▁▁▁ |
| Other_Female_65_years_and_over | 0 | 1 | 0.24 | 0.72 | 0.01 | 0.05 | 0.09 | 0.18 | 6.20 | ▇▁▁▁▁ |
| Other_Male_10_to_19_years | 0 | 1 | 0.52 | 0.80 | 0.03 | 0.15 | 0.28 | 0.56 | 5.58 | ▇▁▁▁▁ |
| Other_Male_20_to_29_years | 0 | 1 | 0.48 | 0.70 | 0.03 | 0.17 | 0.29 | 0.53 | 5.33 | ▇▁▁▁▁ |
| Other_Male_30_to_39_years | 0 | 1 | 0.44 | 0.71 | 0.03 | 0.14 | 0.26 | 0.47 | 5.06 | ▇▁▁▁▁ |
| Other_Male_40_to_49_years | 0 | 1 | 0.35 | 0.65 | 0.02 | 0.09 | 0.19 | 0.34 | 5.13 | ▇▁▁▁▁ |
| Other_Male_50_to_64_years | 0 | 1 | 0.33 | 0.73 | 0.01 | 0.08 | 0.15 | 0.29 | 6.50 | ▇▁▁▁▁ |
| Other_Male_65_years_and_over | 0 | 1 | 0.19 | 0.58 | 0.01 | 0.03 | 0.07 | 0.14 | 4.51 | ▇▁▁▁▁ |
| White_Female_10_to_19_years | 0 | 1 | 5.72 | 1.38 | 0.94 | 5.01 | 5.82 | 6.61 | 9.45 | ▁▁▇▆▁ |
| White_Female_20_to_29_years | 0 | 1 | 6.09 | 1.36 | 1.59 | 5.23 | 5.92 | 6.93 | 9.66 | ▁▂▇▅▂ |
| White_Female_30_to_39_years | 0 | 1 | 6.17 | 1.22 | 1.53 | 5.46 | 6.28 | 7.02 | 8.95 | ▁▁▅▇▂ |
| White_Female_40_to_49_years | 0 | 1 | 5.58 | 1.23 | 1.20 | 4.85 | 5.68 | 6.41 | 8.33 | ▁▁▇▇▃ |
| White_Female_50_to_64_years | 0 | 1 | 6.56 | 1.45 | 1.72 | 6.01 | 6.57 | 7.34 | 11.40 | ▁▂▇▂▁ |
| White_Female_65_years_and_over | 0 | 1 | 6.38 | 1.70 | 1.05 | 5.37 | 6.62 | 7.51 | 9.90 | ▁▁▆▇▂ |
| White_Male_10_to_19_years | 0 | 1 | 6.04 | 1.43 | 1.02 | 5.30 | 6.14 | 6.95 | 9.74 | ▁▁▇▇▁ |
| White_Male_20_to_29_years | 0 | 1 | 6.28 | 1.33 | 2.41 | 5.43 | 6.12 | 7.14 | 10.84 | ▁▆▇▃▁ |
| White_Male_30_to_39_years | 0 | 1 | 6.27 | 1.19 | 1.93 | 5.59 | 6.33 | 7.06 | 9.67 | ▁▂▇▆▁ |
| White_Male_40_to_49_years | 0 | 1 | 5.59 | 1.22 | 1.35 | 4.79 | 5.67 | 6.43 | 8.39 | ▁▁▇▇▂ |
| White_Male_50_to_64_years | 0 | 1 | 6.26 | 1.40 | 1.78 | 5.64 | 6.19 | 6.94 | 10.93 | ▁▂▇▂▁ |
| White_Male_65_years_and_over | 0 | 1 | 4.56 | 1.19 | 1.02 | 3.80 | 4.78 | 5.34 | 7.51 | ▁▂▇▇▁ |
| Unemployment_rate | 0 | 1 | 6.03 | 2.10 | 2.30 | 4.50 | 5.60 | 7.20 | 17.80 | ▇▇▂▁▁ |
| Poverty_rate | 0 | 1 | 13.34 | 3.84 | 5.70 | 10.40 | 12.70 | 15.50 | 27.20 | ▃▇▅▁▁ |
| Viol_crime_count | 0 | 1 | 31760.77 | 46494.46 | 322.00 | 5107.50 | 14412.00 | 38782.50 | 345624.00 | ▇▁▁▁▁ |
| Population | 0 | 1 | 5446810.81 | 6070687.33 | 404680.00 | 1363737.50 | 3504892.00 | 6411701.00 | 37349363.00 | ▇▂▁▁▁ |
| police_per_100k_lag | 0 | 1 | 316.57 | 115.59 | 83.76 | 248.53 | 301.00 | 358.31 | 1021.14 | ▅▇▁▁▁ |
| RTC_LAW_YEAR | 0 | 1 | Inf | NaN | 1985.00 | 1995.00 | 1997.00 | 2011.00 | Inf | ▇▇▂▅▂ |
| TIME_0 | 0 | 1 | 1980.00 | 0.00 | 1980.00 | 1980.00 | 1980.00 | 1980.00 | 1980.00 | ▁▁▇▁▁ |
| TIME_INF | 0 | 1 | 2010.00 | 0.00 | 2010.00 | 2010.00 | 2010.00 | 2010.00 | 2010.00 | ▁▁▇▁▁ |
| Viol_crime_rate_1k | 0 | 1 | 5.05 | 3.19 | 0.48 | 2.85 | 4.54 | 6.44 | 29.30 | ▇▃▁▁▁ |
| Viol_crime_rate_1k_log | 0 | 1 | 1.45 | 0.60 | -0.74 | 1.05 | 1.51 | 1.86 | 3.38 | ▁▂▇▅▁ |
| Population_log | 0 | 1 | 14.99 | 1.05 | 12.91 | 14.13 | 15.07 | 15.67 | 17.44 | ▃▅▇▅▂ |
OK! We are now ready to start analyzing our data!.
We have what is called panel data. This is a special type of longitundinal data. Longitudinal data are data measurments taken over time. Panel data are data repeatidly measured for the same panel members or individuals over time. In our case, we have measurements of violent crime and other varaibles each year for each state. Therefore we are using measurements about the same states over time.
In a panel analysis there are \(N\) individual panel members and \(T\) time points.
There are two types of panels:
1) Balanced - At each time point, there are data points for each individual. 2) Unbalanced - There may be data points missing for some inidividuals at some timepoints.
Overall in a Balanced panel we have \[n\] observations, where \[n = N*T\].
In an Unbalanced panel we would the number of observations is less than \[N*T\].
In our case we have:
\(N\) = 45 states (recall that we removed those who had adopted a RTC law before 1980)
\(T\) = 31 years (1980 - 2010)
In every year we have measurements for each state, thus our panel is balanced.
So, our total observations \(n = 45*31\), thus \(n\) = 1395.
We will be performing a panel linear regression model analysis.
There are three general subtypes of panel regression analysis: 1) independently pooled panels - assumes that there are no individual effects that are independent of time and also no effect of time on all the individuals. This is essentially an ordinary least squares linear regression. 2) fixed effects - assumes that there are differing aspects about the individuals that do not vary with time or are fixed over time that are correlated with independent varaiables. 3) random effets - assumes that there are the unqiue qualities about the individuals are not correlated with independent variables.
AVOCADO:This is just from my literature review and what wikipedia says but I find this a bit confusing.
See here for more information about these different models.
We will be performing a fixed effect panel regression analysis, as we do in fact think that some of the unobserved qualities about the different states that may be correlated with some of our independent variables. For example, the level of economic opportunity might be an unobserved feature about the states that influences violent crime rate and would be correlated with poverty rate and unemployment.
In such an analysis we will model our data according to this generic model:
\[Y_{it} = a + B'X_{it} + u_{it},\\ u_{it}=\mu _{i}+v_{it}\] Where \(i\) is the individual dimension (in our case individual states) and \(t\) is the time dimension.
\(\mu _{i}\) are individual effects that are time independent (aka. fixed over time) \(v_{it}\) are random effects that do vary with time
https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test
Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
d_panel_DONOHUE <- pdata.frame(DONOHUE_DF, index=c("STATE", "YEAR"))
DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
effect = "twoways",
model = "within",
data=d_panel_DONOHUE)
summary(DONOHUE_OUTPUT)Twoways effects Within Model
Call:
plm(formula = Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years +
White_Male_20_to_39_years + Black_Male_15_to_19_years + Black_Male_20_to_39_years +
Other_Male_15_to_19_years + Other_Male_20_to_39_years + Unemployment_rate +
Poverty_rate + Population_log + police_per_100k_lag, data = d_panel_DONOHUE,
effect = "twoways", model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.57957437 -0.08942194 -0.00090654 0.08673054 1.11216999
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE 0.01796779 0.01663911 1.0799 0.2804066
White_Male_15_to_19_years -0.00091825 0.02724210 -0.0337 0.9731160
White_Male_20_to_39_years 0.03466473 0.00972839 3.5633 0.0003794 ***
Black_Male_15_to_19_years -0.05673593 0.05746052 -0.9874 0.3236341
Black_Male_20_to_39_years 0.12605439 0.01931450 6.5264 9.607e-11 ***
Other_Male_15_to_19_years 0.69201638 0.11322394 6.1119 1.297e-09 ***
Other_Male_20_to_39_years -0.30276797 0.03811855 -7.9428 4.226e-15 ***
Unemployment_rate -0.01685806 0.00489952 -3.4408 0.0005984 ***
Poverty_rate -0.00780235 0.00295720 -2.6384 0.0084280 **
Population_log -0.17991653 0.06041773 -2.9779 0.0029559 **
police_per_100k_lag 0.00060391 0.00013689 4.4115 1.111e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 36.716
R-Squared: 0.15031
Adj. R-Squared: 0.095138
F-statistic: 21.0514 on 11 and 1309 DF, p-value: < 2.22e-16
DONOHUE_OUTPUT_TIDY <- tidy(DONOHUE_OUTPUT, conf.int = 0.95)
DONOHUE_OUTPUT_TIDY$Analysis <- "Analysis 1"We will also perform a test to evaluate if there is indeed an individual effect and a time effect in our model. We can use the plmtest() function of the plm package. This performs a Lagrange FF Multiplier Test … or is it
DONOHUE_pool_model <-plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
effect = "twoways",
model = "pooling",
data=d_panel_DONOHUE)
plmtest(DONOHUE_pool_model, effect = c("twoways"))
Lagrange Multiplier Test - two-ways effects (Honda) for balanced
panels
data: Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years + ...
normal = 72.56, p-value < 2.2e-16
alternative hypothesis: significant effects
This might be the same as the F reported in the larger output…
http://www.princeton.edu/~otorres/Panel101R.pdf
need pFtest to compare ols and plm test… ols being lm … look at link above
Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
LOTT_variables <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames()
LOTT_fmla <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables, collapse = " + ")
)
)
d_panel_LOTT <- pdata.frame(LOTT_DF, index=c("STATE", "YEAR"))
LOTT_OUTPUT <- plm(LOTT_fmla,
model = "within",
effect = "twoways",
data=d_panel_LOTT)
summary(LOTT_OUTPUT)Twoways effects Within Model
Call:
plm(formula = LOTT_fmla, data = d_panel_LOTT, effect = "twoways",
model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5448906 -0.0780395 0.0026738 0.0788052 0.5847263
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE -0.04687169 0.01641851 -2.8548 0.0043758 **
White_Female_10_to_19_years 0.62441376 0.15103427 4.1343 3.793e-05 ***
White_Female_20_to_29_years -0.05942541 0.06332108 -0.9385 0.3481763
White_Female_30_to_39_years 0.16028113 0.08045953 1.9921 0.0465755 *
White_Female_40_to_49_years 0.10087510 0.08170707 1.2346 0.2172082
White_Female_50_to_64_years -0.37624966 0.06303172 -5.9692 3.083e-09 ***
White_Female_65_years_and_over 0.20636690 0.04742430 4.3515 1.460e-05 ***
White_Male_10_to_19_years -0.59141591 0.14436974 -4.0965 4.457e-05 ***
White_Male_20_to_29_years 0.08717546 0.05862342 1.4870 0.1372503
White_Male_30_to_39_years -0.12514225 0.08588569 -1.4571 0.1453400
White_Male_40_to_49_years -0.21812366 0.07293615 -2.9906 0.0028375 **
White_Male_50_to_64_years 0.37845575 0.07314122 5.1743 2.653e-07 ***
White_Male_65_years_and_over -0.20915907 0.06659815 -3.1406 0.0017246 **
Black_Female_10_to_19_years -1.03146594 0.43610403 -2.3652 0.0181697 *
Black_Female_20_to_29_years -0.02721685 0.17462559 -0.1559 0.8761693
Black_Female_30_to_39_years -0.03246043 0.20498789 -0.1584 0.8742037
Black_Female_40_to_49_years 0.43820099 0.23524130 1.8628 0.0627234 .
Black_Female_50_to_64_years 0.04906111 0.21393128 0.2293 0.8186482
Black_Female_65_years_and_over 0.07226074 0.24373031 0.2965 0.7669130
Black_Male_10_to_19_years 1.22536162 0.44559642 2.7499 0.0060447 **
Black_Male_20_to_29_years -0.06587312 0.18392655 -0.3581 0.7202909
Black_Male_30_to_39_years 0.24720746 0.23673862 1.0442 0.2965804
Black_Male_40_to_49_years -0.66869983 0.27173041 -2.4609 0.0139904 *
Black_Male_50_to_64_years -0.16737616 0.23977741 -0.6980 0.4852740
Black_Male_65_years_and_over -0.58743446 0.34691532 -1.6933 0.0906404 .
Other_Female_10_to_19_years 0.70957924 0.49539878 1.4323 0.1522910
Other_Female_20_to_29_years -1.16489945 0.26997487 -4.3148 1.720e-05 ***
Other_Female_30_to_39_years -3.40258912 0.35368437 -9.6204 < 2.2e-16 ***
Other_Female_40_to_49_years 1.34563633 0.42503994 3.1659 0.0015825 **
Other_Female_50_to_64_years 2.93990932 0.33830653 8.6901 < 2.2e-16 ***
Other_Female_65_years_and_over 2.36026239 0.20422580 11.5571 < 2.2e-16 ***
Other_Male_10_to_19_years 0.07481449 0.47835310 0.1564 0.8757423
Other_Male_20_to_29_years 1.62895925 0.25740603 6.3284 3.420e-10 ***
Other_Male_30_to_39_years 3.17421278 0.41184489 7.7073 2.566e-14 ***
Other_Male_40_to_49_years -1.58494177 0.44840281 -3.5346 0.0004229 ***
Other_Male_50_to_64_years -3.91523867 0.37399898 -10.4686 < 2.2e-16 ***
Other_Male_65_years_and_over -4.16596244 0.36860536 -11.3020 < 2.2e-16 ***
Unemployment_rate -0.00545734 0.00436374 -1.2506 0.2113054
Poverty_rate -0.00572362 0.00253162 -2.2609 0.0239357 *
Population_log -0.21716335 0.08452664 -2.5692 0.0103068 *
police_per_100k_lag 0.00069547 0.00013331 5.2171 2.118e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 23.647
R-Squared: 0.45275
Adj. R-Squared: 0.40355
F-statistic: 25.8088 on 41 and 1279 DF, p-value: < 2.22e-16
comparing_analyses <- DONOHUE_OUTPUT_TIDY %>%
bind_rows(LOTT_OUTPUT_TIDY) %>%
filter(term == "RTC_LAWTRUE")
library(grid)
comparing_analyses_plot <- ggplot(comparing_analyses) +
geom_point(aes(x = Analysis, y = estimate)) +
geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
geom_hline(yintercept = 0, color = "red") +
scale_y_continuous(breaks = seq(-0.2, 0.2, by = 0.05),
labels = seq(-0.2, 0.2, by = 0.05),
limits = c(-0.2,0.2)) +
geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "green",
lineend = "butt",
linejoin = "mitre") +
geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "red",
lineend = "butt",
linejoin = "mitre") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text = element_text(size = 12)) +
labs(title = "Effect estimate on ln(violent crimes per 100,000 people)",
y = "Effect estimate (95% CI)")
comparing_analyses_plotHow did the above happen?
The analysis dataframes are very similar yet rendered very different results.
- different number of columns: 20 vs 50
[1] TRUE
The only difference between the two dataframes rests in how the demographic variables were parameterized.
[1] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[3] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[5] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[1] "Black_Female_10_to_19_years" "Black_Female_20_to_29_years"
[3] "Black_Female_30_to_39_years" "Black_Female_40_to_49_years"
[5] "Black_Female_50_to_64_years" "Black_Female_65_years_and_over"
[7] "Black_Male_10_to_19_years" "Black_Male_20_to_29_years"
[9] "Black_Male_30_to_39_years" "Black_Male_40_to_49_years"
[11] "Black_Male_50_to_64_years" "Black_Male_65_years_and_over"
[13] "Other_Female_10_to_19_years" "Other_Female_20_to_29_years"
[15] "Other_Female_30_to_39_years" "Other_Female_40_to_49_years"
[17] "Other_Female_50_to_64_years" "Other_Female_65_years_and_over"
[19] "Other_Male_10_to_19_years" "Other_Male_20_to_29_years"
[21] "Other_Male_30_to_39_years" "Other_Male_40_to_49_years"
[23] "Other_Male_50_to_64_years" "Other_Male_65_years_and_over"
[25] "White_Female_10_to_19_years" "White_Female_20_to_29_years"
[27] "White_Female_30_to_39_years" "White_Female_40_to_49_years"
[29] "White_Female_50_to_64_years" "White_Female_65_years_and_over"
[31] "White_Male_10_to_19_years" "White_Male_20_to_29_years"
[33] "White_Male_30_to_39_years" "White_Male_40_to_49_years"
[35] "White_Male_50_to_64_years" "White_Male_65_years_and_over"
Clearly, this had an effect on the results of the analysis.
Let’s explore how this occured.
When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted.
In regression analysis, this distortion is often a byproduct of a violation of the independence assumption. This distortion, if large enough, can impact statistical inference.
There are several ways we can diagnose multicollinearity.
Again, multicollinearity often occurs when independent variables are highly related to one another. Consequently, we can evaluate these relationships be examining the correlation between variable pairs.
It is important to note that multicollinearity and correlation are not one and the same. Correlation can be thought of as the strength of the relationship between variables. On the other hand, multicollinearity can be thought of as the the violation of the independence assumption that is a consequence of this correlation in a regression analysis.
[1] "YEAR" "STATE"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[9] "Unemployment_rate" "Poverty_rate"
[11] "Viol_crime_count" "Population"
[13] "police_per_100k_lag" "RTC_LAW_YEAR"
[15] "RTC_LAW" "TIME_0"
[17] "TIME_INF" "Viol_crime_rate_1k"
[19] "Viol_crime_rate_1k_log" "Population_log"
DONOHUE_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))LOTT_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))cor_DONOHUE_dem <- cor(DONOHUE_DF %>% dplyr::select(contains("_years")))
corr_mat_DONOHUE <- ggcorrplot(cor_DONOHUE_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 1",
legend.title = TeX("$\\rho$"))
corr_mat_DONOHUEcor_LOTT_dem <- cor(LOTT_DF %>% dplyr::select(contains("_years")))
corr_mat_LOTT <- ggcorrplot(cor_LOTT_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 2",
legend.title = TeX("$\\rho$"))
corr_mat_LOTTsims <- 250
# DONOHUE
# round(dim(DONOHUE_DF)[1]/2)
samps_DONOHUE <- lapply(rep(dim(DONOHUE_DF)[1]-1, sims),
function(x)DONOHUE_DF[sample(nrow(DONOHUE_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_DONOHUE <- function(split){
plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_DONOHUE <- lapply(samps_DONOHUE, fit_nls_on_bootstrap_DONOHUE)
samps_models_DONOHUE <- samps_models_DONOHUE %>%
map(tidy)
names(samps_models_DONOHUE) <- paste0("DONOHUE_",1:length(samps_models_DONOHUE))
simulations_DONOHUE <- samps_models_DONOHUE %>%
bind_rows(.id = "ID") %>%
mutate(Analysis = "Analysis 1")
## LOTT
samps_LOTT <- lapply(rep(round(dim(LOTT_DF)[1]/2), sims),
function(x) LOTT_DF[sample(nrow(LOTT_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_LOTT <- function(split){
plm(LOTT_fmla,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_LOTT <- lapply(samps_LOTT, fit_nls_on_bootstrap_LOTT)
samps_models_LOTT <- samps_models_LOTT %>%
map(tidy)
names(samps_models_LOTT) <- paste0("LOTT_",1:length(samps_models_LOTT))
simulations_LOTT <- samps_models_LOTT %>%
bind_rows(.id = "Analysis") %>%
mutate(Analysis = "Analysis 2")
simulations <- bind_rows(simulations_DONOHUE,
simulations_LOTT)
simulation_plot <- simulations %>%
filter(term=="RTC_LAWTRUE") %>%
ggplot(aes(x = Analysis, y = estimate)) +
geom_jitter(alpha = 0.25,
width = 0.1) +
labs(title = "Coefficient instability",
subtitle = "Estimates sensitive to observation deletions",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_minimal() +
theme(axis.title.x = element_blank())
simulation_plotdesign.matrix <- as.data.frame(model.matrix(DONOHUE_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_DONOHUE$Viol_crime_rate_1k_log)
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE + # logical class changes variable name after inital model
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = design.matrix)
vif(lm_DONOHUE) RTC_LAWTRUE White_Male_15_to_19_years White_Male_20_to_39_years
1.097853 1.172339 1.738459
Black_Male_15_to_19_years Black_Male_20_to_39_years Other_Male_15_to_19_years
1.344193 1.653712 1.586648
Other_Male_20_to_39_years Unemployment_rate Poverty_rate
1.529688 1.244667 1.270321
Population_log police_per_100k_lag
1.153933 1.204491
vif_DONOHUE <- vif(lm_DONOHUE)
vif_DONOHUE <- vif_DONOHUE %>%
as_tibble() %>%
cbind(., names(vif_DONOHUE)) %>%
as_tibble()
colnames(vif_DONOHUE) <- c("VIF", "Variable")
max_vif_DONOHUE <- max(vif(lm_DONOHUE)) design.matrix <- as.data.frame(model.matrix(LOTT_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_LOTT$Viol_crime_rate_1k_log)
LOTT_variables_ols <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames() %>%
str_replace("RTC_LAW", "RTC_LAWTRUE") # logical class changes variable name after inital model
LOTT_fmla_ols <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables_ols, collapse = " + ")
)
)
lm_LOTT <- lm(LOTT_fmla_ols,
data = design.matrix)
vif(lm_LOTT) RTC_LAWTRUE White_Female_10_to_19_years
1.621662 127.920555
White_Female_20_to_29_years White_Female_30_to_39_years
42.269637 49.635014
White_Female_40_to_49_years White_Female_50_to_64_years
37.550101 36.451868
White_Female_65_years_and_over White_Male_10_to_19_years
12.866751 126.824984
White_Male_20_to_29_years White_Male_30_to_39_years
39.248785 73.008959
White_Male_40_to_49_years White_Male_50_to_64_years
31.613855 52.774694
White_Male_65_years_and_over Black_Female_10_to_19_years
13.285326 335.136906
Black_Female_20_to_29_years Black_Female_30_to_39_years
106.644486 79.058455
Black_Female_40_to_49_years Black_Female_50_to_64_years
98.434064 66.888057
Black_Female_65_years_and_over Black_Male_10_to_19_years
49.715869 320.740453
Black_Male_20_to_29_years Black_Male_30_to_39_years
89.297151 89.267356
Black_Male_40_to_49_years Black_Male_50_to_64_years
92.498486 64.538516
Black_Male_65_years_and_over Other_Female_10_to_19_years
37.960126 142.283700
Other_Female_20_to_29_years Other_Female_30_to_39_years
64.966861 54.511835
Other_Female_40_to_49_years Other_Female_50_to_64_years
224.567085 131.463113
Other_Female_65_years_and_over Other_Male_10_to_19_years
82.394398 151.930450
Other_Male_20_to_29_years Other_Male_30_to_39_years
54.620045 62.267344
Other_Male_40_to_49_years Other_Male_50_to_64_years
244.698473 174.184553
Other_Male_65_years_and_over Unemployment_rate
53.532299 1.497864
Poverty_rate Population_log
1.412397 3.426475
police_per_100k_lag
1.732745
vif_LOTT <- vif(lm_LOTT)
vif_LOTT <- vif_LOTT %>%
as_tibble() %>%
cbind(., names(vif_LOTT)) %>%
as_tibble()
colnames(vif_LOTT) <- c("VIF", "Variable")
max_vif_LOTT <- max(vif(lm_LOTT))[1] 1.738459
[1] 335.1369
\[\frac{1}{1-R_{i}^{2}}\]
vif_DONOHUE$Analysis <- "Analysis 1"
vif_LOTT$Analysis <- "Analysis 2"
vif_df <- rbind(vif_DONOHUE,
vif_LOTT)
vif_plot <- vif_df %>%
ggplot(aes(x = Analysis, y = VIF)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
geom_hline(yintercept = 10, color = "red") +
scale_y_continuous(trans = 'log10',
limits = c(1,1000)) +
labs(title = "Variance inflation factors") +
theme_minimal() +
theme(axis.title.x = element_blank())
vif_plotTidyverse
Writing functions
Longitudinal studies
Panel data
Confidence intervals
Linear regression model
For more information on linear regression see this book and this case study.
avocado fill this out
https://rpubs.com/rslbliss/fixed_effects
http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
https://stats.stackexchange.com/questions/99236/effects-in-panel-models-individual-time-or-twoways